Thanks for posting this one, Dave, and hopefully it is educational for many.
I work in the hundreds and thousands of amps, mucking around with data centres, switchboards, large UPS systems, HVAC, power quality etc. Tin whiskers does not really enter my sphere of operation but its closely related cousin 'Zinc Whiskers' kept me well occupied for quite a few years.
I first ran into zinc whiskers around 25 years ago. I'd been called out to a small data centre where they were experiencing an extraordinary number of failures of Sun servers. A few a week where giving up the ghost and the management of the centre were not happy campers. The data centre was typical, a raised floor computer room supporting racks of equipment which were cooled by airflow that came from under the floor. I immediately assumed that there was a power issue on the site and installed a power disturbance monitor, a BMI 8800 for those playing at home, and left this for a week or two to gather data. Came back after the monitoring period to be told that a few more servers had served their last. Went back to the office to begin the examination of data.
The BMI 8800 printed out its reports on thermal paper. It would print out daily status reports and any time that it had captured an event it would print out a list of parameters for the event as well as an oscillogram of the event. So even a short investigation like this could generate quite a ream of paper. The good old days weren’t
. I trudged through this ream of data spotting an ‘out of parameter’ event now and then but nothing that could be pointed to as a smoking gun that could have caused any server failures.
Back to site and lots of crawling under the floor, jerking on cables, taking photos, all the while not realising that I was exacerbating the problem. I convinced the client to install some high speed surge suppression (silicon avalanche) devices within one section of the data centre to try to narrow the problem down. Still. no resolution. Servers were still going down for no rhyme nor reason and it didn’t matter where I installed monitoring or suppression. At this point the data centre management were not the only one’s losing hair.
The funny thing is, all these years later, I have no memory of how I discovered what the issue was. Whether it was through a desperate internet search, very unlikely as the internet was still nascent; a technical paper I came across, a wiser older colleague. I really have no memory of the aha! Moment. Turns out that it was the floor tiles. The bottom of them was zinc plated and from this plating zinc whiskers could quickly grow. Lifting a tile and shining a torch along it could quickly show the whiskers being produced. The whiskers were being disturbed, floating around and eventually landing upon some delicate part of the server.
Data centre management were even unhappier when I told them the problem, “your room is killing your computers”. Very expensive to mitigate. Shut everything down and pull it out of the room. Have it extensively cleaned. In the room pull up the floor, have the whole room ‘clean room’ cleaned. Replace the flooring tiles with a new version that does not contain zinc. Replace equipment, start up again and then pray.
Once I got through that first one I then found it popping up regularly for the next ten years or so. I’d get called in to a data centre to resolve a mysterious problem. “We turn on our air-conditioning and our computers start blowing up” would be a common refrain. I’d grab a floor tile lifter, bring up a tile, look at the base and then shake my head in a sorrowful manner. This was a very expensive problem to fix. It got to the stage where I’d get phone calls and I wouldn’t even bother going to site. I’d spend ten minutes on the phone and diagnose it then and there and then hang up before the sobbing and wailing became too extreme. The industry learnt, adapted and moved on.
Continued in part 2...