Author Topic: Bit fade testing: memtest86 vs memtest86+ (Read 5457 times)

rhb · « **on:** March 14, 2021, 12:21:46 am »

Searches produced nothing recent or relevant, so I thought I'd try here.

I've got 3x 4 slot HP Z400s and 1x 6 slot Z400. The 6 slot machine has become unstable running Solaris 10 u8 after some 10 years of operation.

I know from experience that the issue is bit fade. But identifying the bad DIMM is proving very troublesome.

I compiled memtest86+ from source. Both the 6 slot and a 4 slot machine report copious bit fade errors. There have been no issues with the 4 slot machine. I should note that all Z400s use ECC memory. Whether compiled from source or a binary ISO download, both machines report essentially every location as bad using memtest86+. Both machines pass all the other tests in the suite with both versions.

The free version of memtest86 4.3.7 from the Passmark website reports no bit fade errors on either machine nor errors on any other tests.

I have been studying the source for memtest86+ and do not see any obvious errors. However, I had one run in which the value read was reported as the address. That led me to examine the source in hope that it was simply a failure to dereference a pointer, but that proved not to be the case.

I have a general idea of the addressing mode being used in memtest86+, but not good enough to spot an error.

If you have not encountered bit fade, it is the nastiest HW fault I know of. A freshly booted system will complete an operation such as "dd if=/etc/passwd of=/dev/tape" without a problem. But leave the system for a day and repeat the same command the next evening and the system will kernel panic. I spent about 2 weeks learning about that on a 3/60 for which I had built a custom kernel. I eventually simply went back to using the generic kernel which mapped the bad RAM to a device I did not have. What makes the problem so difficult is that it might take days for the bit to fade.

What's is happening is that critical kernel data is being written at boot time, but the refresh rate is not adequate to maintain the correct values.

Any illuminati in the neighborhood? I'd like to modify memtest86+ to initialize memory and then read at intervals that double up to a user specified limit so it would locate the 2-3 day fade problems. But I've got to get it to work at all first.

Reg

retiredfeline · « **Reply #1 on:** March 14, 2021, 12:35:23 am »

Sorry. can't help you but thanks for introducing me to a new technical term I can use.

"Sorry I forgot about that appointment, I've got bit fade."

rhb · « **Reply #2 on:** March 14, 2021, 12:43:06 am »

I just modified the source to use a millisecond sleep between writing a block of memory and reading it. This produced an example of the value read aka "error" being the address.

Unfortunately, there doesn't appear to be a functioning forum for memtest86+. I'd *really* like to solve this as it is the sort of problem that will drive a mission critical systems admin out of his mind.

Reg

golden_labels · « **Reply #3 on:** March 14, 2021, 01:58:18 am »

This is a very old bug that appeared back in 2013. Version 4.20, if you can get it, doesn’t exhibit the problem.

rhb · « **Reply #4 on:** March 14, 2021, 02:20:03 am »

Can you explain the bug? I'd like to fix it.

Reg

hamster_nz · « **Reply #5 on:** March 14, 2021, 02:52:12 am »

Does it use ECC? If so, can you not run HP's diagnostics to read the hardware logs?

rhb · « **Reply #6 on:** March 14, 2021, 03:31:22 am »

It is ECC. What diagnostics and what logs? I run Solaris 10 u8 on this. My experience to date with HP diagnostics is they are a waste of time for what I am doing.

It came with a Win 7 Pro license, but I use that on a Vista licensed Z400 I bought for $100.

Were it not for my long standing interest in the problems that bit fade causes, I'd just buy 4 GB DIMMs and do an upgrade from 12 to 24 GB. Actually, I'm going to do that any way. But I do *not* lose arguments with mere machines of any type. And especially with this issue. Quite simply this is a "death match" and I expect to live long enough to win.

Reg

Monkeh · « **Reply #7 on:** March 14, 2021, 03:37:07 am »

If it is, as you believe, a RAM issue, you should be seeing uncorrectable errors from the memory controller. The Z400 should be recording these - accessing that I leave up to you and their documentation.

If the memory controller is not reporting errors, it seems likely that you're chasing your tail.

Nominal Animal · « **Reply #8 on:** March 14, 2021, 08:49:12 am »

Quote from: rhb on March 14, 2021, 12:21:46 am

I compiled memtest86+ from source.

Which version? 5.31b? 5.01? 4.20? 4.10? 5.01+5.01-3.1 Debian? Or perhaps the most maintained-looking Coreboot memtest86plus fork?

rhb · « **Reply #9 on:** March 14, 2021, 04:24:41 pm »

I compiled Memtest86+ 5.31b from http://www.memtest.org/

The Coreboot repository link:

https://review.coreboot.org/memtest86plus.git

returns "Not found". I can find the individual files, but no way to download a tarball.

Sad to say after dealing with SCCS, RCS, Subversion, Mercurial and a few other version control systems I sort of lost enthusiasm for learning new ones. I continue to use RCS as it suits my needs. I did set up a git repository once, but lost interest in that project and with it my knowledge of git.

Running Solaris 10 u8 on an HP Z400 is shall we say "not supported any longer" as both are 10 years old. Nor was it ever likely that the system would log ECC errors on Solaris. However, thanks for the tip. I shall look at the Z400 service manual to see if it offers any information.

The HP documentation from that era is very Windows centric, and generally of little use to me. So I have tended not to look at it much. Just knowing that there is a facility for logging ECC errors helps.

I discovered that there are real serial ports on the mother board. They require a level shifter kit which I found on ebay and ordered a pair. Once those arrive I should be able to run memtest86+ under gdb as a remote target. I'm sure that will be an adventure as I've not done that but a few times many years ago with an MSP430.

Memtest86+ 5.31b writes patterns to memory in slices. That makes sense for a lot of the other tests, but for a bit fade that takes 1-2 days to appear, is not optimal. So I'm going to see if I can puzzle out the code so that it writes a pattern to all memory above where it is located and then tests at ever longer intervals as I outlined earlier.

On to the HP Z400 service manual!

Have Fun!
Reg

Edit:

Well, as I feared, according to the Z400 service manual excess ECC errors "generate a local user alert". So only applicable to Windows.

Nominal Animal · « **Reply #10 on:** March 14, 2021, 04:53:07 pm »

Quote from: rhb on March 14, 2021, 04:24:41 pm

I can find the individual files, but no way to download a tarball.

Run
git clone https://review.coreboot.org/memtest86plus.git
and it'll create and download it into memtest86plus/ under the current working directory.
When in the memtest86plus/ directory, git pull will check and download any changes.

I would recommend trying this one, because it is the only tree/fork that seems to be maintained.

Monkeh · « **Reply #11 on:** March 14, 2021, 04:53:39 pm »

Quote from: rhb on March 14, 2021, 04:24:41 pm

The Coreboot repository link:

https://review.coreboot.org/memtest86plus.git

returns "Not found". I can find the individual files, but no way to download a tarball.

It's git, so you need to use git. It's not exactly hard to find instruction on this.

Quote from: rhb on March 14, 2021, 04:24:41 pm

Well, as I feared, according to the Z400 service manual excess ECC errors "generate a local user alert". So only applicable to Windows.

They have remote management support with an appropriate HP NIC which should expose such errors. Solaris should be able to log ECC failures with FMA, but don't ask me how to use it.

rhb · « **Reply #12 on:** March 14, 2021, 07:45:18 pm »

All the stuff I've gotten from github has had a zip or tar option. I've got the bootcore code now and will do a diff with the other version.

The Z400 has a Broadcom NIC embedded in the board. The larger problem is what it wants to talk to and someplace to run it.

At the moment I'm trying to backup a 3 TB ZFS mirrored pool to a 12 TB USB disk which I just bought. I'm a bit worried that the WD USB disk might be going to sleep in the middle of the write operation. It wouldn't be the first time I ran into something that stupid.

It might not be the DIMMs. It could be the memory controller on the MB. Mostly I want a reliable test.

Because of the very serious problem that bit fade poses I expect this will eat a bunch of my time. The Z400 BIOS doesn't appear to offer any control of DRAM parameters. ECC is not foolproof. One can have multibit errors that are not detected. It's better than not having ECC, but by no means a complete solution to all problems.

Time to search for an HP remote support utility.

Reg

golden_labels · « **Reply #13 on:** March 15, 2021, 03:02:53 am »

Quote from: rhb on March 14, 2021, 02:20:03 am

Can you explain the bug? I'd like to fix it.

Unfortunately no.

ejeffrey · « **Reply #14 on:** March 15, 2021, 03:33:51 am »

It is possible to have multi bit errors that ECC can't correct but if that is happening there will also be lots of correctable errors as well. If you aren't seeing that, the either ECC isn't enabled, you aren't looking at the log, or the problem is not with the memory itself.

hamster_nz · « **Reply #15 on:** March 15, 2021, 04:02:33 am »

Quote from: ejeffrey on March 15, 2021, 03:33:51 am

It is possible to have multi bit errors that ECC can't correct but if that is happening there will also be lots of correctable errors as well. If you aren't seeing that, the either ECC isn't enabled, you aren't looking at the log, or the problem is not with the memory itself.

At best it takes 4 bit errors to silently corrupt ECC memory....

* A first bit flip for a single bit error

* Another bit flip for a multi-bit error

* Another bit flip for it to be a single bit error, which can be ECC-corrected to the wrong value

* And a last bit flip for it to become the wrong value, with valid ECC.

That's a big ask... not saying ECC is perfect, but it is pretty effective at finding memory errors.

ejeffrey · « **Reply #16 on:** March 15, 2021, 05:43:33 am »

Exactly. Basically my point is that if you have a faulty DIMM, whatever else happens you are going to have a lot of single-bit correctable errors that should be showing up in your logs. If you don't see those your memory may or may not be bad, but at least one other thing is wrong.

rhb · « **Reply #17 on:** March 15, 2021, 04:27:18 pm »

I have been unable to find any reference to a way to access a log of ECC memory errors on the Z400 despite numerous searches. There is nothing in the BIOS setup menus.

I went through the entire HP Z400 service and maintenance manual. There are some POST codes with reference to memory errors listed in the manual, but I've never seen one.

I did download a couple of Softpaqs which might help. One is for Windows and the other for Linux, but both are .exe files so I have to transfer them to a Windows system to unpack them. At the moment I'm running Debian on that system.

At present the problem system is booted from an OI LiveImage disk and has been sending a zfs filesystem image to a USB drive for 18 hours. Not sure how much longer that will take.

At this point few, if any, of the DIMMs are in the slots they previously occupied and they have been inserted and removed 3-4 times. The recent issue where it became too unstable to complete the transfer of this zfs pool to the USB drive appears to have been caused by a bad connection that the multiple insertions has corrected. Up until the system became too unstable to finish the zfs send operation it had never had the DIMMs touched and had run close to 24x7 for 10 years. The exceptions being instances where the heat build up from running 3 Z400s was such I had to take something down because it was 80 F in my lab/office.

Once I get the system back together I'll do a scrub, leave the system idle for a day or two and repeat the scrub. My expectation is that the first scrub will go fine and the 2nd will kernel panic.

Simple questions:

If you write to DDR3 ECC RAM and you never access that memory location again, will the system detect an error in that section of memory?

If so, how does this work and where is this documented?

I could find lots of general explanations of what ECC memory is, but no specifics of operation.

Reg

Monkeh · « **Reply #18 on:** March 15, 2021, 05:06:11 pm »

https://docs.oracle.com/cd/E18752_01/html/816-5166/fmadm-1m.html
https://docs.oracle.com/cd/E18752_01/html/816-5166/fmdump-1m.html
https://docs.oracle.com/cd/E18752_01/html/816-5166/fmstat-1m.html

rhb · « **Reply #19 on:** March 15, 2021, 06:07:50 pm »

Oh, so cool! A huge thanks!!!

I made the transition to Solaris 10 at home when we were still running 8 at work and once they fired the really good admins I no longer had any admins to have lunch with and thus never learned the fault management system. One of the guys they fired made a practice of reading all the man pages once a year. He was a real gem.

Before my introduction to Unix, I was a grad student admin for a MicroVAX II and had to hike across campus to study the "grey wall" as I did not have a manual set.

I recently loaded all the Solaris manuals onto a 12.9" iPad Pro and think it is really neat that I have tens of thousands of pages of manuals in an 8.5" x 11" x 0.25" form factor. More documentation than I could physically lift in paper form presented in precisely the same format.

You have really made my day!

Have Fun!
Reg

rhb · « **Reply #20 on:** March 16, 2021, 04:51:21 pm »

It seems that according to the fault management system there are no memory faults at all. In fact there are no events other than zfs events in the 10 years since I set the system up..

This leaves me rather baffled as to how scrubs started immediately after a reboot complete succeed. But if I start the same operation a day or two after booting the system it consistently crashes. After the reboot immediate scrubs consistently succeed.

So time to investigate the crash dump. It crashed and rebooted after spending 25 hours transferring a zfs pool snapshot to a 12 TB USB drive. However, as I was running the LiveImage from DVD the crash dump didn't get written to disk.

Reg

Monkeh · « **Reply #21 on:** March 16, 2021, 04:56:20 pm »

Well, that's all under the assumption it has a driver for that memory controller and that it's loaded, and that any logs are persistent - but if there is bit fade it's statistically unlikely that you wouldn't encounter some correctible errors at the least for it to log before an outright failure.

Again, it's very possible you're chasing your tail when the issue could be a bad USB controller or device, or a power supply, or something else. If the tool won't agree with your diagnosis you need to question both.

rhb · « **Reply #22 on:** March 16, 2021, 05:10:00 pm »

The fact that different versions of Memtest86 produce different results, but both versions produce the same result on 2 different machines is very difficult to understand. Unless it's a bug in the version I've compiled and neither machine has a memory fault.

The error bits matching the address with a 1 ms sleep is also weird.

Something is not right, but what is still a mystery. Probably time to use dtrace to determine where the zfs process is in memory and periodically read the structure that gets read when you start a scrub.

Reg

rhb · « **Reply #23 on:** March 16, 2021, 05:46:05 pm »

If you write to memory and don't read it again, can the system detect an ECC error? If so, how? Is there sufficient logic in the refresh that it can detect ECC errors?

I looked for the DDR3 specifications but all I found was ads and "how to"s. No technical document that describes the architecture and operation. I learned that Samsung developed the specs, but that was all I learned.

Reg

rhb · « **Reply #24 on:** March 16, 2021, 11:10:33 pm »

I've had verification that ECCs are *only* computed on a read. So a function pointer table initialized at boot which faded would not generate an ECC error because it caused a kernel panic. This is exactly what I have assumed was the case.

I booted the system, completed a scrub of all the pools with no errors. I'm leaving the system idle and late tomorrow I'll repeat the scrub. I expect a kernel panic and this time I'll take a look at the core dump to see if I can divine from it which DIMM is bad. I suspect that I cannot, but I'll try. Mostly I'm looking for a pointer dereference fault. Dtrace might let me get more detailed, but I'll likely need some guidance on that.

Reg


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Bit fade testing: memtest86 vs memtest86+ (Read 5457 times)

Share me