Email or username:

Password:

Forgot your password?
Hector Martin

I think I never told this story here... how I fixed a server with a very precisely placed piece of tape.

So at Euskal Encounter we got shiny new servers a few years back, and they worked great except one of them developed a peculiar problem. It would not shut down.

When told to shut down, it would either hang, or boot back up, or power back up but then fail to boot. This was a problem, because we normally relied on servers shutting down and staying down during our shutdown procedure. Having to have someone babysitting the machine to yank the power is not great. Plus it meant that if we ever got into that state, we couldn't fix it remotely (and some events are run remotely). Once the problem happened, no amount of shutdown/power up/reboot commands to the BMC would fix it (eventually it would start logging power control errors).

So we pulled the server out after an event, and sent it for RMA. It came back saying the techs couldn't reproduce the problem. And indeed, we powered it up on the bench, and it seemed fine.

Stuck it back in the rack at the next event, and it stopped working again.

At this point I was thinking this must be some kind of electrical issue caused by mechanical stress, so we tore it apart and jiggled all the cables and made sure all the connections were tight.

No dice.

This whole thing took several years, since we could only really work on the machine during events (and I kind of live halfway across the world). It just kept on limping on with that bug since we couldn't find time to dive deep into the issue.

At one point I started thinking... What's the difference between the server being in the rack and not? That all the cables are plugged in, particularly USB and Ethernet cables.

Could it be Wake-on-LAN? So I checked the WoL settings, but it was indeed switched off on all the Ethernet interfaces. And besides, we had two identical servers and only one had the issue. I sniffed the network looking for stuff that might pass as a WoL magic packet, but came up empty.

Still, I couldn't find another explanation, so I did the logical test. Unplugged the Ethernet cables, and tested it. It worked fine. Plugged the cables in. The problem reappeared.

Oooookay.

In particular, it was the 4 cables connected to the add-on PCIe network card.

So I swapped the cards on both servers and guess which other server started having buggy shutdowns!

Just in case, I tried upgrading the firmware on the card, but that didn't help.

At this point I'm starting to think about RMAing the card, but that would take time and it'd be hard to explain what the problem is. Buying another card would be an extra expense, and cause us to have different configurations on both servers (which is less desirable).

And then I thought... I'm never going to use this feature, ever. These are servers with BMCs, we can turn them on over IPMI. So this Ethernet card is sending broken/random wake signals to the PCIe slot when it has an Ethernet link? Okay.

I asked for some tape and scissors, pulled the server out again, took the card out, carefully cut out a small sliver of tape, and placed it over the WAKE# pin on the PCIe edge connector. Put it all back together and tested it again.

Problem fixed.

42 comments
aburka 🫣

@marcan Now waiting for the war story from your successor "you'll never guess what I found when I took apart this one server that wouldn't wake-on-lan"

Christian Gudrian

@marcan This reminds me of one of my French lessons where our teacher used to play back tapes with French dialogues he recorded off the TV at home.

Once during a break we wrapped one connector of the power plug with transparent tape.

Problem fixed.

Phil Thane βœ…

@cgudrian @marcan Ah, techie tricks on teachers. This one took two weeks. We had an old-fashioned class room with glass in the door and brass door knobs. Week 1 smeared the doorknob with transparent glue (Bostik) and enjoyed teacher's expression through the glass. Week 2 teacher arrives, grins through the glass at us and carefully touches the doorknob with a finger tip. By the time he picked himself off the floor the HUGE and thoroughly charged capacitor had disappeared.

vince

@marcan very kind of you to not just rip off the whole pad... i wonder if the tape will stick in place after the rack unit is decommissioned and thrown wholesale into the shredder πŸ˜†

Hector Martin

@0x56 The rack is a custom pallet rack made of wood, I'm sure if it ever gets decom'd we'll find it a worthy home ;)

Si Dawson

@marcan Ok, THAT is some advanced diagnostics+fixing. I'm impressed.

0x10f

@marcan IIRC, you could unlock the multiplier on some Pentium II CPUs by covering one of the pins (SEL66/100#) with tape. (Some mobos also let you change the setting in BIOS.)

Apicultor 🐝

@0x10f @marcan Holy shit, what a memory you just brought back.

I feel old now.

Asshole. πŸ˜‚

CTHULHU KAWAII

@marcan The "Support will never understand it so I'll fix it by myshelf" is one of those little things that makes anyone excelent but managemt and recruiters ignores fully.

Drew Mayo

@marcan @asie truly awesome example of lingering hidden tech debt.

Matthieu

@marcan Nice story there. Thanks for sharing πŸ˜†

Rook

@marcan Certainly quite the eventive solution. I'm definitely not that resourceful as I likely wouldn't've thought of doing that. Pretty neat.

KasTas

@marcan reminds me early teen wonders of having some random decommissioned motherboard on the table and discovering that You can short couple of PCI pins to turn the PC on and off....

BlueDotAZ

@marcan That’s some fancy troubleshooting there. Well done!

chrismarget

@marcan You've just reminded me that add-on NICs used to have a wake-on-LAN *cable* for the same purpose. Been a long time since I've thought about that.

wb x64

@chrismarget @marcan always fun to wire the nic activity led pins to the front panel power led so you have network and hard drive activity indicators

Jim Luther

@marcan That’s a good story. It reminds me that I need to write down some of my stories from my two careers - first as an electronics tech/field engineer, and then as a software engineer.

karolherbst 🐧 πŸ¦€

@marcan back in my PowerMac days, I installed a AGP x8 GPU into my AGP x4 machine and it required two pins to be cut off/taped over :) fun times

Kermode

@marcan I love hardware war stories :-)
I was on course once and I said to a guy how I was never on course with anyone from my area. The guy from sperry said they do it on purpose, because it spreads all our war stories in the bar after the day. I don't think that's true, but I do know I often learned more talking to other techs in the bar than in the course.
Software bugs are a dime a dozen - I can fat finger any language into breaking before first coffee, but there's something enormously satisfying about tracking down a pesky, intermittent hardware bug! :-)

@marcan I love hardware war stories :-)
I was on course once and I said to a guy how I was never on course with anyone from my area. The guy from sperry said they do it on purpose, because it spreads all our war stories in the bar after the day. I don't think that's true, but I do know I often learned more talking to other techs in the bar than in the course.
Software bugs are a dime a dozen - I can fat finger any language into breaking before first coffee, but there's something enormously satisfying...

curtmack

@marcan The cheapo 3D printer control board I have tries to power on when I plug it into my Raspberry Pi's USB, so I have a piece of tape covering the +5V pin on the USB A connector.

(This was also how I learned that a +5V connection is unnecessary if you only want data and both devices are powered independently during real use.)

Scott Laird

@marcan ahh, this reminds me of a different server Ethernet power problem I once had. Around 2000, we'd just received a new batch of test servers for the lab, before changing our new POP buildouts. These were using Intel's first generation of IPMI motherboards. I racked them up at the end of the day on Friday, then came back on Monday and went to install them, and 0 of 12 would even POST. I hit the power button, LEDs lit up, but no video, and nothing happened.

Scott Laird

@marcan I unracked one and moved it to the bench. Plugged it in, booted fine. WTF?

It took me a shockingly long time to figure out that I'd racked them and installed Ethernet and KVM cables (test lab...) but had totally failed to finish the *power* cables on Friday before leaving.

But they were able to pull enough current over the *Ethernet* cable somehow to power the BMC, which controlled the front panel LEDs, so the power switch actually "worked".

Hector Martin

@laird I bet it was the KVM cables. There's no way for Ethernet to power anything other than via PoE (which needs explicit hardware on the receiving end), but HDMI/VGA cables have a convenient 5 volt pin. It's supposed to be from the machine to the monitor, but I wouldn't at all be surprised if some random KVMs get that backwards and backfeed power into the hosts...

Scott Laird

@marcan that's pretty much what I'd expected at the time, but the problem followed the Ethernet cable. I could repro with nothing else plugged in.

This was (iirc) an Intel L440BX+ motherboard and an early Cisco 4006 switch. I'm pretty confident that whatever was happening, it was *way* out of spec.

Intel's first couple server boards (the T440BX and L440BX+) were *weird*. Like, the T440BX powered the CPU from the PSU's 3.3V line, drawing 2x the amps most early ATX PSUs provided.

Scott Laird

@marcan ah, L440GX+, not BX+. That was the one where the AGP bus was bridged into an extra 64-bit PCI bus.

Hector Martin

@laird That... that must really have been out of spec. If it was not PoE I have no idea how you managed to power a BMC though a transformer-coupled Ethernet link!

Mia Luna Tearmoon

@marcan This is how I temporarily fixed an HDMI hot plug detect issue on a Bravia telly before, tape over the pin.

Of course, it only lasted for a bit and I had to go for a more involved approach, soldering the HPD wire to +VCC through a 1 kOhm resistor.

Michael Miller :blobrdm: πŸ¦†

@marcan same way I fix the 3.3vdc HDD issues for my NAS 😎

Erik S

@marcan I had to mask out the SMBus pins on an older 4-port card so that I could use it in a more recent thin-client that I'm using as a firewall. Otherwise the machine gave a memory error during POST.

Go Up