Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

My favorite hardware bug

My favorite hardware bug

  • Comments 19

Adi Oltean asks: What's your favorite Bug?

My personal favorite was on the ICL PWS-400.  The ICL PWS-400 was a custom hardware design built by ICL. I was on the team of 5 (two from Microsoft, three from ICL) whose job it was to port MS-DOS 4.1 to this new hardware.  The cool thing about the PWS-400 was that it had some custom hardware that allowed real mode applications to access bank switched memory in 4K pages.  This allowed apps to run in the background without impacting running applications.

Since the five of us we were the entire development team, we also did a lot of ad-hoc testing.  One of my personal favorites was running a game that Valorie had brought me from school.  I'm not sure which game it was now, but every time I played it, when it got to a specific spot, the machine would spontaneously reboot.

We put the machine under an ICE (in circuit emulator - a hardware tool that lets you see what's going on inside and outside the CPU) and discovered that the CPU was being externally reset.  That ruled out some wierd software bug.

The hardware guys took the game and the machine and started looking.

After a couple of days, they came back to me and announced they'd found the problem.  It turns out that the trace on the motherboard for the PC speaker was too close to the trace on the motherboard for the CPU reset line.  When you played a specific sound on the PC speaker, EMF emissions from the speaker trace would cause the CPU reset to go high, which caused the CPU to reboot.

 

Gotta love working with hardware :)

 

  • My favorite hardware bug: In 1984 I was working on a CPU upgrade for my employer's main product. We were replacing an 8051 design with a shiny new 8 Mhz 80186. We had exactly one (wire-wrapped!) prototype board and I was writing the boot code using a pair of LED's as my output device. I had a strange hanging bug that I binary searched down to a single instruction: an IDIV. The code always hung at this IDIV. Did we have a bad CPU chip? Did Intel make a broken IDIV instruction? (Of course, that would be impossible!)

    I sweated over this problem for about a week. Then I invited the hardware engineer, Scott (who was also my boss), over to look at it. His first act was to put a meter on the DC power. It read 4.4VDC instead of ~5VDC. I replaced the power supply and POOF! no more IDIV hangs.

    Lesson Learned: Check power and ground.
  • From Larry Osterman’s blog: Adi Oltean asks about favorite hardware bugs.
    I have two that I have...
  • I have two that I have come across that drove me nuts, and they were both related, though seperated by 5 or so years. The first happened while I was working for Progressive Strategies (now The Edison Group) in New York City. Attached to one of the PCs was a mouse that would go haywire when it crossed a certain portion of the desktop. But it didn’t happen every time. After a while we narrowed it down to a certain time period during the day that it would go crazy. And the place on the desk where it would do its thing would move as well. After spending way too much time trying to figure out the problem we discovered the answer. But before I tell you what it is, let me tell you about the other bug as they are somewhat related.

    A few years later I moved to Sunnyvale California. Before I arrived I found a 1988 SAAB 900 SPG for sale in Petaluma. When I picked it up, the previous owners told me that the radio had a wierd tendency to spit out the CD every now and then. They had taken it into the shop a few times but nothing seemed to be wrong, so the shop simply swapped out the radio for the same model from Clarion. But the problem persisted. After a few months of driving around with this car I discovered that it only spat out a disc when I was driving in a certain direction…and then only at certain times of the day.

    So what was the problem? Well both of them were caused by the sun. The mouse had a space between the two buttons that, when the sun would shine on it at the right angle, would make the mouse get confused about where the mouse ball was rolling. Similarly, when the sun would shine through the sunroof of my Saab and into the CD slot of my car stereo, the sensor that recognized the presence of a CD got confused and spat out the CD.

    So that is my favorite bug….same bug, different devices.

  • Matt, that's a brilliant bug.

    One that happened to me a long time ago: I had a 386SX16 with 1MB of RAM on the mobo. Added 2MB in SIMMs (remember those?). The system then refused to boot. I took the entire system to the store. They opened it up, reseated the SIMMs and started the computer. It booted up just fine.

    I returned home and started the system. Same problem as before. Turns out that the problem was caused by the power supply cables being near the SIMMs and causing interference. At the store, they had taken out the power supply to gain access to the SIMMs, moving the cables out of the way and allowing the system to boot.

    Yet another hardware problem that can be solved with duct tape.
  • Surely some (but not all) Intel processor errata are due to the same kind of EMI problems.

    At a former employer one of many problems was EMI from a board, an ordinary product purchased in the marketplace, putting noise on an IDE cable. Of course it took quite a while to track down the cause of Windows 2000 hanging. Some kinds of absurd operations seemed to be causing hangs, such as a mouse click on a menu item to refresh a view. It finally turned out that when Windows 2000 detected an error in a signal from a CD-ROM drive, Windows 2000 insisted on writing an event to the log file. If the hard drive was using the same IDE cable then the system never came back. Windows NT4, no problem. Windows 2000 with separate IDE cables, no problem (but there were log entries, which started to give a hint).

    By the way what does EMF stand for? I learned EMI as electromagnetic interference, not always caused by static on vinyl records ^_^ And then here was Matt Williams blaming the sun, but I thought you guys were only supposed to blame linux, 'cause if you blame sun then who's next, sco? ^_^
  • EMF can be several things depending on what you're talking about. Usually it's either ElectroMotive Force (i.e. voltage) or ElectroMagnetic Field. If you're talking about interference, it's probably the latter.
  • Many years ago one of our installations with PDP computers would reboot during the night. After weeks of trying to track down the problem, we discovered that the night operator would pull the PDP away from the wall and then go to sleep behind the computer. In his sleep he would knock the plug out of the wall.

    :P
  • I changed the plugs and wires in my '97 Cadillac Deville. After about a week, I noticed that the fan motor was making strange noises, and the noises would change relative to the speed of the engine.

    It stumped everyone, until the fan motor burned out and we bought a new one, which came bundled with foil EMI shield. It turns out that when I changed the spark plug wires, I left one of the wires too close to the fan motor. The motor's controller used small voltage variations to change the fan's speed, and pulses from the spark plug wire caused the strange fan noises and burned out the motor!

  • Back in the dark ages of the 1980's I fancied myself a EE. Once frequencies on the boards started to equal that, of say, WIBA-FM all I seemed able to create was EMI radio stations. I knew at that moment in time that my future lay in software.
  • I'm not sure I'd call it my favourite bug.

    In the MSN division, I seemed to be the nettoyeur for all the ugly bugs.

    We were close to shipping.

    We had this memory corruption happening during printing. Using the debugger, the memory corruption turned out to be a buffer overflow in memory sent down to the USB printer.

    So using one of the tricks I picked up in the Exchange/NT group, I allocated memory with an uncommitted guard page after the valid memory, rebuilt the WinCE+product image, and started up the debugger.

    Voila.

    Nada. No memory corruption. The allocated memory was being filled with valid data, but any overwrite into the guard page wasn't raising an exception.

    If I committed memory for the guard page and followed the repro steps, then I'd get a buffer overflow.

    Conclusion: The hardware was overwriting the buffer by a small amount of memory, ~8 bytes.

    Solution: Round up any memory passed to the USB driver so the valid size was at least 8 bytes more than needed.


    For the next version of the product, they backed out my 'fix'. Lo and behold - buffer overflow occurs. Since we were months away from shipping this time, the assigned dev got hold of an ICE and discovered the same thing I did with a user-mode debugger - the USB chip was writing beyond the buffer. Back in goes my fix. The assigned dev sent off some email to the hardware vendors.

    It turns out the USB chip in question had a bug. They sent some instructions on how to work around it.
  • We were working on an in-house designed newspaper reporting workstation in 1982. (Later to be replaced by a PC with custom software.)

    I was working on the "high" speed communications uplink... a 9600baud synchronous modem. It was only getting about 12% of the rated speed. We tracked it down to dropped characters and retransmission timeouts but were stumped by the frequent data corruption errors.

    Until we got out the circuit diagrams. The comm jack was on a small board that plugged into the mother board. In the socket the pins for send and receive clocks had been reversed, so we were timing the received data against the send clock. This worked great most of the time, but about every 20 seconds the drift from the clocks being slightly out of sync would corrupt a dozen bits or so!
  • The best one I came across was an old VAX which was being used in a steelworks for keeping track of material as it was being transported around the site. Everyday, at just after 5pm, the machine would disappear from the network for 5 minutes, and then reappear having logged a reboot. When someone then was sitting in the room with it, nothing happened so we left it alone again and, sure enough, it rebooted again. Eventually we realised that the cleaner had been going in, unplugging the power from the wall to plug in her vacuum cleaner, and then plugging it back in when she left.
  • I once dropped a bead of sweat into an big new HP server I was tasked with (forced by HP) assemblying. On power up, the video was terribly corrupted with divets and colored squares, etc. Not realizing the sweat issue, I opted to go to lunch. About an 1-1/2 hours later, it booted fine and has worked flawlessly since.
  • That's funny Wound, the same thing happened to Kimbery Tripp.

    http://www.lazycoder.com/weblog/index.php/archives/2005/08/05/beware-urban-legends/
  • From Larry Osterman’s blog: Adi Oltean asks about favorite hardware bugs.
    I have two that I have come...
Page 1 of 2 (19 items) 12