Email or username:

Password:

Forgot your password?
195 comments
purple 👊✊💨

@thegibson@hackers.town have you not SEEN the crap i post??

i post it because it's a learning experience -- i just hope someone else is able to learn along with me

Arcans

@thegibson Failed so far at being a well-adjusted to society "normal" adult with "normal" goals and expectarions in life.

Not sure that's such a bad thing though.

CatButtes :verified_coffee:

@TheGibson I knock out our SharePoint farm on a fairly regular basis. 90% of the time it’s because I forgot to rate limit a script and just told it to move/delete 10k files in the space of a few seconds. It will happily accept the requests and start processing in the background until it can’t any more. Then it goes splat…

CMDR Yojimbosan UTC+(12|13)

@thegibson Small business customer in a remote site, 15 years ago. Had a linux server driving their network for email and filestore. RAID card failed, and no replacement cards were available, so we loaned them a spare box & copied from the one working drive. All good so far ...
Then a few weeks later we got the replacement card, so the old server was reinstalled & I took it to them (a 4 hour drive, the company's owner took me there).
In a small cramped space, set up the new server, and ran an rsync from the old one to copy the last few weeks data across.
It ran really fast.
Really really fast.
And didn't list any files ...
Because there weren't any.
I'd sync'd the new blank machine onto the old "full of customer data" one.
There were no backups of course, because the customer didn't have a backup facility, they had RAID cards ...
The owner drove me back home that evening, another 4 hours. It felt much much longer ...
@lightweight might know if I was ever forgiven ... I was working for him at the time, it was his customer ...

@thegibson Small business customer in a remote site, 15 years ago. Had a linux server driving their network for email and filestore. RAID card failed, and no replacement cards were available, so we loaned them a spare box & copied from the one working drive. All good so far ...
Then a few weeks later we got the replacement card, so the old server was reinstalled & I took it to them (a 4 hour drive, the company's owner took me there).
In a small cramped space, set up the new server, and ran an rsync...

Dave Lane 🇳🇿

@yojimbo @TheGibson hah! It's been a weird day of unexpected memories. This one was pretty memorable, for sure. I felt very bad for you, as I think you did take a lot of precautions.. and I felt very bad for the customer. We continued working with them until I sold the business quite a few years later. I don't think it was discussed much subsequently, but I don't think it affected them too badly. :). I hope it hasn't troubled you overmuch since.

CMDR Yojimbosan UTC+(12|13)

@lightweight @thegibson Luckily it was off-season for them, so there were no customer enquiries to track; and the only thing they'd been using email for was organising new uniforms from a supplier; so the supplier would have had all the copies of correspondence. So they were as mellow as possible under the circumstances!

In a technical interview a few years later I was asked "what's the worst thing you've done to a customer?" so I told the story. "What would you do differently?" was the next question, so I said I'd change the prompt on the old and new server using something like the MAC address to check (as that's the only unique identifier). The unix beard nodded, and said "yes, that's what I did when I had a job like that; but I got the prompts the wrong way around ..."

So I got that job :-)

I keep that question for interviews with people who have had sysadmin responsibility, it's a good one.

@lightweight @thegibson Luckily it was off-season for them, so there were no customer enquiries to track; and the only thing they'd been using email for was organising new uniforms from a supplier; so the supplier would have had all the copies of correspondence. So they were as mellow as possible under the circumstances!

Martin, Regulus Tithonium

@TheGibson one of my first projects at Amazon (in 2000) was to put a wishlist button everywhere there was an add to cart button. I reversed the “add to cart” v “preorder” logic on music search, probably cost ~$100k in sales over 24hr. More than twice my annual salary at the time.

SmittyHalibut

@tithonium @thegibson I too have caused a NA Order Drop!

Me: *clicks button*
TOS: Order drop!
Me: …was that me?
Coworkers: Nope. We do this all the time. It’s good.
Me: Ok… *continues*
TOS: *continues debugging order drop*
Me, 15 minutes later: *finishes, clicks button again*
TOS: Order recovery!
Me: …
*30 minutes pass*
Me: I need to click the button again.
Coworkers: It’s cool, wasn’t you.
Me: *clicks button*
TOS: Order drop!
Me: *clicks button again, steps away*

Alexander Bochmann

@TheGibson Building a datacenter network on Juniper Virtual Chassis, after initial testing seemed promising.

It saved us a ton of work at a time where network automation was not yet a thing, and brought us into the 10G (EX) and 40G (QFX) era... But after half an hour of downtime during one of the "non-stop software upgrade" procedures (that was the final straw, there were problems every time before), no one dared touch the things for another upgrade for years until the platform was replaced.

Alexander Bochmann

@TheGibson I got a job offer after spending some (too much) time on writing a Perl script that did traffic accounting from TIS FWTK log files, 1995-ish. It worked quite well, until one day it didn't, and all the output was garbage.

I didn't know about integer overflows.

Luckily for me, Math::BigInt showed up somewhere around the time that happened.

Dorian Gray’s Papaya

@thegibson updated around 4 million records in production with bad data. We have robust rollback mechanisms, so it only took about a day to fix.

D. Düsentrieb

@TheGibson In the first few months at my first job (at a online payment provider) I accidentally sent around 110000€ to the wrong people

I think we got like 107000€ them back

kemonine

@thegibson i have 2

once i broke diagnosis history tracking in a large medical record system and the bug made it to production ; they lost 3 months of diagnostic history summaries due to the bug (this is v. bad for those not in healthcare)

one day i got a panicked phone call from a customer because the 'fix' i had just deployed to production broke /all/ medication dispense data entry in the system ; the entire nursing staff had to cancel their appointments for the day

f.rift :fire_blue:

@thegibson
Froze backups across my entire section, because our workstations were connected as a Beowulf cluster and nobody told me that saving data continually would keep the disks active preventing backup. Only found out because I had no cubical so was in the main room when the very frazzled admin came in trying to solve it.

f.rift :fire_blue:

@thegibson
My biggest fail is that I have so few fails because despite the passage of time I have relatively little experience.

f.rift :fire_blue:

@thegibson
Outside of tech... The entire recent situation. All of it. Worst mistakes of my life contained therein.

Rolli

@thegibson First day on a gig, I broke some libraries that a bit over a hundred people were depending on because I didn’t understand their dev process. Then went to orientation for the next two hours and was unreachable while they tried to figure out what just happened :). Unbroke it in five minutes when I found out at least

XorOwl

@TheGibson fucked up a raid and lost a weeks worth of accounting data for three companies.
I still have my job and have had the chance to redeem myself

SmittyHalibut

@xorowl @thegibson the sign of a good company.

Good decisions come from experience. Experience comes from bad decisions.

You just gained a bit of experience. :-)

XorOwl

@smitty @TheGibson exactly! i've helped to cultivate exactly such a culture at my work.

Chrisshy Keygen

@thegibson

source local.env
bash scripts/replace_db_with_dumpfile.sh

"Huh... it's taking longer than usual"
...
"Huh, the local env still has the old data... wtf?"

export
ENVIRONMENT=production

"Well, that's probably not great..."

χaʀⱱøs

@thegibson i directly edited database on production server and gave the client an unexpected april’s fool

David Chartier

@TheGibson Early in my writing career at a moderately popular Apple news site (TUAW) I wrote about what I thought was a silly little piece of shareware that opened the CD drive on Mac towers.

Turns out it also ran a script that moved ALL your calendar events back by two weeks. Readers were not happy.

The person who made it never intended for it to get picked up in the news. They gave me a script readers could use to fix it all.

Ninja :blobninja: :nsbadgeblue:
@chartier @thegibson

Used to check TUAW daily when I had my first Mac. (Aluminum PowerBook G4) :blobthumbsup:
Max "Max Sweaty" Eddy

@chartier I spent way too much time reading TUAW back in the day. Really appreciated all yall's work.

David Chartier

@maxeddy Hey thanks for reading and the kind words.

abortretryfail

@thegibson

Learned the hard way that `killall` on AIX is not the same as `killall` on Linux. I use `pkill` now instead...

Taggart: ~# :idle:

@thegibson Brought down a school's wireless infrastructure for 2 days because I did not yet understand VLANs. Did not leave campus until every switchport was manually repatched, labeled, and tagged/trunked.

megabyteGhost / beatMage / G

@thegibson One time, at my $dayJob, I completely wiped out website without backing it up first.

Luckily, the host does a backup every 24 hours so we were able to get it back up in just a couple hours.

Why didn't I back it up before I did what I did to wipe it? I don't know.

Overconfidence in my abilities I guess.

Matthias Schmidt

@thegibson Nearly 20y ago, I was the new guy at Openwall/GNU/Linux. Wanted to commit my first packages and documentation, however, my CVS experience was quite low. I ended up creating directories in the main repo and committing crap that the mighty Solar Designer then had to revert. He even explained CVS basics to me. I was so embarrassed...

Your friendly 'net denizen

@thegibson Maybe too esoteric, but I once spent weeks (maybe a month) trying to track down a bug in my verilog code for an FPGA on a new board bringup. I was stressed, the teams was stressed. I was rewriting entire modules and theorizing strange metastability bugs. In the end, the problem was that I had neglected to check a particular box in the configuration for the project so that the generated bitstream would kick the FPGA out of its initialization sequence after it loaded the bitstream. :/

Your friendly 'net denizen

@thegibson I also once walked across a carpeted lab, reached out my hand to push a reset button on an expensive electronic board I was working on, and a nice static discharge arced from my finger before I could hit the button. ALL the smoke started to come out of that board. 😆

Jack McKracken

@thegibson not my worst one, but the funniest one:

Fucked up the computers of a thousand+ of our lanparty guests, since we set DHCP option 46 to only use WINS and not broadcast for NetBIOS name resolution (p-node).

Windows set that flag permanently and people were not able to find other computers back at home anymore.

Solutions was of cause editing something in the registry by hand to restore the original behavior. Took us a week to figure that out, people were pretty pissed 😅

SeTec Astronomy :blobbandage:

@thegibson

Kind of tech related:

Coworker had spent days trying to make replacement radio transmitter work. The receivers were all acting like they were expecting PL (sub-audible tone), but he was insistent that the old transmitter wasn't configured for PL. Boss sends me down. I read all the docs. I try *everything.* Days pass. No progress made. Boss finally arrives, pulls out service monitor, in a matter of seconds "Look, there's PL on the carrier, right there"

Apparently co-worker was looking in the wrong spot in the programming of the old transmitter and I just assumed he knew WTF he was talking about. Lesson learned.

@thegibson

Kind of tech related:

Coworker had spent days trying to make replacement radio transmitter work. The receivers were all acting like they were expecting PL (sub-audible tone), but he was insistent that the old transmitter wasn't configured for PL. Boss sends me down. I read all the docs. I try *everything.* Days pass. No progress made. Boss finally arrives, pulls out service monitor, in a matter of seconds "Look, there's PL on the carrier, right there"

SeTec Astronomy :blobbandage:

@thegibson

Not tech related:

Roommate/homeowner wants to tile his countertops. We measure everything carefully. Note down the sizes of each section in inches. Add them all up. OK, this tile is sold in square feet. No problem, just divide by 12...

We had a LOT of extra tile.

notptr

@TheGibson I ran svn up on our production server during working hours.

Coyote of the Brown Grass

@thegibson IBM cash registers of old used a version of Token Ring called Store Loop. I was replacing an IBM cash register base on Black Friday and instead of unplugging it from the IBM data connector on the wall (which would self-short to prevent opening the loop) I unplugged it from the back of the register.

Every cash register in the store immediately stopped. Hard.

All the cashiers went "Hey, did your register stop?" I said "Yes, not sure what happened." as I hastily plugged it back in.

Coyote of the Brown Grass

@ColinTheMathmo @thegibson Huh, interesting stuff!

I wonder if a backpressure mechanism would work too. Fibre Channel had a backpressure mechanism that would say "stop, I'm full" when the buffers hit max, because FC was a lossless protocol.

Colin the Mathmo

@gedvondur The problem in this case was that the comms system couldn't handle the traffic being put in the front end. So somehow either the processing capacity needed to be increased -- which wasn't possible -- or the input needed to be reduced. Slowing it gently across the entire store worked well and was implemented elegantly.

@TheGibson

Colin the Mathmo

@gedvondur The underlying user-facing problem was that flow of data through the system has to be continuous and couldn't be seen to stutter at any point. So making the tills tun a bit slower seems the only thing to do.

@TheGibson

kitns & tamsyn & tails, oh my!

@ColinTheMathmo @gedvondur @TheGibson that made me think of garbage collection, and the similar trade-offs in play there

starfall

@thegibson DNS on the wrong network interface.

I think it took 2 days for the university to get back online...

notptr

@TheGibson another one I thought of. I dd my home partition on my computer and didn't have up to date backups for it.

pamela :flan_butterfly:

@thegibson I botched DNA extraction on samples that had been hand-retrieved from an endangered species.

NixFREAK -

@pamela @thegibson Holy shit ! That beats everything. outch .. What species ?

hacknorris

@thegibson me? right now today is a mistake. i made a poll post for peeps to decide what i should do in my free time... included coding and playing games...
just that they dont know what i was gonna code and in which language... and the worst - poll is mostly answered as coding...
;-;
there :
mastodon.social/@hacknorris/10
any help ?

johanna, at the cafe counter

@TheGibson not 100% on me, but I ran the script without fully understanding someone used a fixed string length instead of a delimiter on a big DB insert, and wound up garbling an entire database full of medical record numbers once the character strings switched from 7 to 8 in length.

This is why we no longer let regional field people wing it on customer requests “correct” database contents with their own scripts.

Max "Max Sweaty" Eddy

@TheGibson At one of the last Moscone Google I/O events, I was standing around with a few other journos. A Google PR person introduces someone to the group and I start chatting and asking some very small talk questions. Everyone else starts acting very weird. It's clear this dude is someone important and I haven't recognized him. I shut up.

Later, my colleague filled me in on who he was: Larry Page.

Max "Max Sweaty" Eddy

@TheGibson Names/faces are not my strong suit. I am constantly re-introducing myself to people. 😬

dctrud

@thegibson spent til 2am on phone with vendor for an 2.5PB HPC storage system, reading kernel panics and module code. Clients and servers had gotten to a state where the metadata servers just boot-looped, blocking all work.

Eventually found complex workaround to use til bug was fixed by vendor. Got nice amount of recognition for my efforts and skills...

Next day approved & applied a change that obviously & trivially brought it down again, costing 100s of scientists 10s of Kcpu- hours of work.

dctrud

@thegibson foolishly I stayed in that HPC job for a while longer.

Somehow (despite general lack of networking background) was made responsible for overseeing a wholesale re-wire of the Infiniband Network for expansion of a cluster.

Rats nests of cables... my oversight failed to catch a label with '6' being read as '9' by a colleague. Didn't have a downtime impact. But took literally weeks for someone else to find and correct the performance issues that resulted. Unmanaged IB switches suck.

Qyv (;* Semi-Tootergeist *;) 🍍

@TheGibson
In my first ever computing-related job I once moved the entirety of the source code for our main application off the backup server onto my work computer. Didn't admit to it at the time, but subsequently found out that the system admin had managed to recover everything from a prior backup, and decided to grant me higher user permissions than anyone else in the office since he thought I'd learnt my lesson and would never do anything as stupid again (:*

Jonathan Lamothe
I had a job where I was doing both hardware and software. I was testing a new controller we were developing. The testing involved having the board exposed while under live power, so I could test voltage levels in various states of operation.

My brain failed me and I disconnected the leads from the 120VAC input before unplugging it from the outlet. Line meets neutral. Sparks fly. The room goes dark.

The servers were on UPSes, but the network switch between them wasn't. They were in the middle of some sort of critical transaction in some third-party application and it corrupted the database.

We had backups, but it was not a good day.
I had a job where I was doing both hardware and software. I was testing a new controller we were developing. The testing involved having the board exposed while under live power, so I could test voltage levels in various states of operation.

Mrs Mouse :verified: :queer:

@thegibson was helping a business team out just because I ran the MFT server... they asked me to send this file- just a one off- and I put it in the outbound queue manually, bypassing some technological controls that I had implemented myself (i.e. file name had to have the client name (that got stripped out) in it, in the same client folder, etc).

Mrs Mouse :verified: :queer:

@thegibson The next eventing a client asked what this unexpected file was. I investigated that night. I didn't sleep well. I didn't eat dinner. I didn't eat breakfast. I camped by my laptop in the morning waiting for my sr mgr to log in. then I messaged him "We need to talk, now. It's important".

Mrs Mouse :verified: :queer:

@thegibson Obviously to the reader, I dropped it into the wrong queue. It's work I wasn't supposed to be doing, just helping out coworkers at a deadline, but it technically became a data breach. The executive partnership incident response team had to be notified and the client had to be notified. We were lucky, it wasn't a PUBLIC breach, and the data was... "hard" to make sense of without a few years of context.

nomad :verified_pride:

@thegibson I have so many I can't remember them. I'm certain that I've done the "rm -rf / path/that/I/meant/to/delete" classic. I've cost users time and email. I've lost data to bad RAID configs or mis-remembering which was target drive and which was source drive.

I've had over 35 years to make mistakes. The thing I hope is that I don't make the same one twice.

Zack Weinberg

@thegibson there's been too many over the many years but the one i'm still currently cringing about is screwing up a formula in a spreadsheet and thus assigning 200 undergraduates incorrect grades

had to get the registrar's office to go into the database and delete them all

Chartodon

@ColinTheMathmo

Your chart is ready, and can be found here:

solipsys.co.uk/Chartodon/10829

Things may have changed since I started compiling that, and some things may have been inaccessible.

The chart will eventually be deleted, so if you'd like to keep it, make sure you download a copy.

evilchili

@thegibson there isn't a character limit high enough in the world to capture my failures. Here's a crucible moment though, from my first or second year in the business:

Working at a regional ISP in the 90s. I zeroed out /etc/passwd on the mail server. predated virtual users so every customer had a system account. Thought I had a) lost my job and b) killed the business. Boss said to me, "Well, now you're a real sysadmin. Go get the install CDs."

evilchili

@steffen @thegibson Without the ability to log into the box restoring the backup was impossible. I don't recall now if we yanked the drive, mounted it and manually repaired /etc/passwd or did an in-place OS install and then restore a backup or what. 25+ years ago :D

evilchili

@steffen @thegibson although I do recall we had backups on tape because I remember having to drag the scsi tape drive out of storage from time to time

kitns & tamsyn & tails, oh my!

@TheGibson i've mentioned mine before. three databases needed upgrading. i was working from home, and (um) may have had a glass of wine as i worked. i backed up the databases. i dropped the databases. i restored the first two databases without issue. i went to restore the third, the smallest and least important (but still pretty important, as it linked customer names to directories) - and...

...i couldn't find the backup.

gradually it dawned on me that i must not have made a backup.

we managed to restore most of them, but for several weeks afterwards we had customers ringing up asking where their websites (it was an ISP) had gone

lessons learned:

don't drink and datait's not backed up if you can't restore iti habitually overestimate my executive function, even taking point 3 into account

@TheGibson i've mentioned mine before. three databases needed upgrading. i was working from home, and (um) may have had a glass of wine as i worked. i backed up the databases. i dropped the databases. i restored the first two databases without issue. i went to restore the third, the smallest and least important (but still pretty important, as it linked customer names to directories) - and...

Mrs Mouse :verified: :queer:

@kitsune @thegibson
If you didn't just verify the backup, there is no backup.

If you haven't tested backups,
There is no backup.

Colin the Mathmo

@TheGibson Porting the NAG serial ForTran library to the Parsys Supernode, I had a directory with a lot of files.

$ "ls -al" ...

It took a while, then a while longer, then people started to raise their heads and wonder what was happening.

Turned out that "ls" was compiled with a fixed size table that I'd just overrun. By a lot. And by coincidence, smashed the stack and overwrote some important OS code.

It wiped the main disk, and everyone's work.

After that when anyone saw me coming ...

Colin the Mathmo

@TheGibson ... they would back up their work across the network[*] and logoff.

[*] The networking could only transfer files, so remote logins weren't possible. That had the benefit that you could see everyone else who was using the machine.

I always had the machine to myself after that, and I paid others the courtesy of letting people know when I'd be working on it so we could coordinate timings.

It really wasn't my fault, but it remains an interesting story.

woozong

@thegibson It was the early nineties while I was in university and I wanted to try out linux on my dad's 486dx2-50. Dual boot on a separate partition went well, and I even set up a mount point for the dos/win partition. After a few weeks Dad needed more diskspace on the dos/win, so I decided to wipe the linux partition... as a first step I decided `sudo rm -rf /` was a pretty good move.

murph

@woozong @thegibson Sounds like you would have plenty of space immediately after.

murph

@thegibson Early in myWindows admin days. Using group policy editor to remove a policy from the admins group. "Are you sure you want to do this?" Of course, remove the policy from this group. Nope, it removed the group, and all of the accounts within it. All in the admin group, gone, including me. Makes perfect sense that the GP editor would remove the group and all members. Luckily a few people in the network group had the rights to re-create one of us to bootstrap back. Shame on me.

bird

@thegibson completely destroyed production Kafka cluster, more senior Ops engineers managed to fix it. But the very next day I did it again

Trysdyn, Jolteon Aspect
@thegibson I crippled my org's entire infrastructure by writing a script to do user cleanup on every server by checking a centralized source and upon a mismatch, killing that user's processes then removing the user.

My central source did not include root.

I learned about the protections a modern linux box has against some major foot bullets that day, but we still had to reboot every box in our infrastructure because I chopped half the load bearing services off at the knees.

That was my first major mistake in a decision-making IT role and I'm incredibly fortunate my boss didn't make it my last.

It's uncanny because my major task right now at a different org is centralized user management. I think about this mistake often.
@thegibson I crippled my org's entire infrastructure by writing a script to do user cleanup on every server by checking a centralized source and upon a mismatch, killing that user's processes then removing the user.

Csepp 🌢

@thegibson Was trying different motherboards in a metal ATX PC case to see which combo worked with the drive docks I wanted to use. Because I was impatient I didn't secure the board to the case and just let it sit on it, without even the standoffs in the way.
Then I turned the machine on, heard lots of beeping. Did a quick hard reset but it no longer booted.

No magic smoke, but that day I learned that metal is in fact a conductor, even when you can't see it touching. 🙃

@thegibson Was trying different motherboards in a metal ATX PC case to see which combo worked with the drive docks I wanted to use. Because I was impatient I didn't secure the board to the case and just let it sit on it, without even the standoffs in the way.
Then I turned the machine on, heard lots of beeping. Did a quick hard reset but it no longer booted.

Csepp 🌢

@thegibson Luckily this was an unused old board that I just had lying around in my room, not anything used by anyone else, or even me.

Still, I feel bad for wasting a working motherboard like that with my carelessness.

Emacsen

@thegibson

rm -rf / *

...as root

...at my first real job

...at NASA

ovelny

@TheGibson Was managing a terrible HR application and got overconfident while applying patches, went for the next one without a snapshot and it failed miserably (it didn't in our test env).

The HR department (hundreds of employees) was left unable to do any work for half a day. We didn't know at the time if it was going to be an easy fix or something that would let us struggle with our backups for *days*.

Been treating prod environments like unexploded bombs since then, which is what they are.

🤖 totally not a bot

@thegibson wired about 10 lightbulb sockets backwards because at the time I thought 110/120v AC was done the same way that 220/240v AC was done, so wouldn't matter which direction I hooked them up, so connected them randomly.

🤖 totally not a bot

@qwazix @thegibson there's a neutral line and a hot line. the hot one bounces and the neutral just sits. 220/240 is done by having two hot lines in opposite phase.

qwazix

@epoch @TheGibson hm, then how come the tester doesn't light up in both leads?

🤖 totally not a bot

@qwazix @thegibson I was talking about how they do 220/240 in the USA, I figure where you're at they do the 110/120 style, but just with a higher voltage.

qwazix

@epoch @TheGibson ah, that makes sense. However we can still wire the light bulbs either way.

🤖 totally not a bot

@qwazix @thegibson I know that a lot of things work in either direction, but for some reason, either the sockets or the specific type of bulb, these didn't light when hooked up backwards. I'm thinking it might be they were LED bulbs that wouldn't work backwards. I haven't tested the bulbs and the sockets independently to figure it out exactly. Guess I could find a lamp that doesn't use a polarized plug and an LED bulb to test it real fast.

Andrew (R.S Admin)

@thegibson We spent probably 20 minutes writing a query to carefully select a subset of production servers upon which to execute a handful of *very* resource intensive commands, and then to disable a bunch of critical services.

We stored that data in a variable.

We referenced a different variable when we ran that command in our production datacenter.

It ran for about 10 minutes before I said "Gee, this is taking a really long time" and then my co-worker went in to a panic and reverted our changes.

Went from multiple million active users to less than 500k in about 5 minutes, back up to multiple millions in less than 3.

Happened so fast that it didn't even trigger an alert.

@thegibson We spent probably 20 minutes writing a query to carefully select a subset of production servers upon which to execute a handful of *very* resource intensive commands, and then to disable a bunch of critical services.

We stored that data in a variable.

We referenced a different variable when we ran that command in our production datacenter.

Parade du Grotesque 💀

@TheGibson

Deleted all shared libs on a production Solaris 10 system.

Was able to rebuild the config using the equivalent of Linux "ldconfig" and rebooted the machine before anyone realized. Solaris ran fine after that.

I also did an "rm -rfv .", not realizing I had changed to ~ instead of a subdir. Lost quite a lot of important files that way. Yes, no backups.

Deckjockey BLU3

@thegibson Hm, can't think of any truly earth-shattering ones off the top of my head. I once enabled auto-update on a prod *nix box, but never went back to clean up old packages. Came in one day to an unhappy environment when the system had downloaded so many kernels, its /boot partition was 100% full and it failed to start up 😅 Thank goodness we had a semi-mature backups system in place. Restored from an image only a handful of hours before with minimal data loss or business impact, and immediately set to work making sure that never happened again.

@thegibson Hm, can't think of any truly earth-shattering ones off the top of my head. I once enabled auto-update on a prod *nix box, but never went back to clean up old packages. Came in one day to an unhappy environment when the system had downloaded so many kernels, its /boot partition was 100% full and it failed to start up 😅 Thank goodness we had a semi-mature backups system in place. Restored from an image only a handful of hours before with minimal data loss or business impact, and immediately...

XorOwl

@nchprgmng @TheGibson ive filled my /boot so many times that expanding it to a gig or two has been baked into my setup process.
Its not impossible to fix by hand but it is a pain in the ass.

Hieronymous Smash

@nchprgmng @TheGibson this was hilarious to stumble on, this exact thing happened in my environment a few months ago (in my defense, I inherited it that way and will fix it I swear)

Simplicity

@thegibson Long ago, I created a website with a webmaster email address on it that forwarded to my boss. Then one day my boss went on vacation, so he set up an autoresponder. Someone mailed the webmaster, it forwarded to the boss, which replied to the webmaster... Which forwarded to the.boss. And I brought down the company mail for a while.

Not my worst fail though. My worst was owning the component that brought down healthcare.gov when it launched. Oops.

Mina

@thegibson because of the way I had configured drbd, I once lost at least a week's worth of customer emails.
and at least one (our biggest) customer to a split brain.

anyway, don't put cluster things on only two nodes.

sudo beep

@TheGibson

1.

Phys Ed Teacher: run around the perimeter of the court, passing and bouncing the ball

Me: If i track the painted line in my peripheral vision I can watch only the ball, and stay on the edge!

Steel pole: is on the painted line

Skull, nose, teeth: *crack* great alignment there! 🤕

sudo beep

@TheGibson

2.

I'm gonna leave lots of bad fails out!

sudo beep

@TheGibson

3.

Not defanging email on a test environment. 250K people received my test email. Lesson learned!

mhoye

@thegibson without going into the details, there are plenty of easy ways to learn that “killall” means very different things to different unixes, but I did not choose any of them.

Lord Kusuriya ​:tower:​

@thegibson @nchprgmng One Time I was patching a cluster by hand, got lost in my RDP sessions, shut the host down. Had to call the customer at 3am to go to the DC to turn it back on

Craig Maloney ☕

@TheGibson More than once I have pushed a change only to watch in horror as it did what I said and not what I intended. More than one there have been .01, .02, ... .09 releases of the same release.

Craig Maloney ☕

@TheGibson I played PWEI's "Not Now James, We're Busy" on college radio.

It's got a very non-FCC-approved line in it.

That sailed out onto the airwaves.

I thought i was a goner when the phone rang. It was the station manager. He asked if the song had a "No" through it. It did not.

He then instructed me to mark that track as a "no" and that was the end of it.

Abbie

@thegibson I left an okay job with excellent insurance and benefits for a seemingly better one with weaker benefits but better pay. I showed up for my first day and the equipment they promised didn’t exist. I used my own equipment for the first three years. o_O

Ninja :blobninja: :nsbadgeblue:
@thegibson Early in my career, I worked on a DIY warehouse management system.

The main item barcode scanning UI had bugs that, when scanning too fast the scans wouldn't register in the db.

I failed to do adequate onsite testing to uncover the bugs until several months after go-live. At the end, the bugs were fixed but for months the warehouse team had to deal with the chaos to find the many "lost" items and reconcile the picks.
@thegibson Early in my career, I worked on a DIY warehouse management system.

The main item barcode scanning UI had bugs that, when scanning too fast the scans wouldn't register in the db.
sad hat

@thegibson In my pre-tech life, I worked inventory for a (now defunct) department store.

At some point, some bad Excel calculations (and general inexperience) led to me accidentally ordering millions of dollars worth of washing machines from one of our suppliers. Way more than we could ever sell in even a year. And the supplier refused to take them back.

Months later, when we were visiting one of our warehouses, I saw them: the endless towers of washers representing my mistake, taking up otherwise valuable floor space. My manager made sure that I noticed.

That manager sucked, the company went out of business (not because of me, but I'd like to think I had a small part), and I switched careers to programming. I've been doing far better now.

@thegibson In my pre-tech life, I worked inventory for a (now defunct) department store.

At some point, some bad Excel calculations (and general inexperience) led to me accidentally ordering millions of dollars worth of washing machines from one of our suppliers. Way more than we could ever sell in even a year. And the supplier refused to take them back.

stn 🇺🇦

@thegibson I wanted to change where the coax line from the cable company entered the house, so I drilled a new hole into the basement. Suddenly, POP!!! The drill pops back. The drill bit was forcibly ejected. It is burning, somehow. The lights in the house are out.

I drilled into the main power line going from the meter to breaker box. I am both incredibly lucky and unbelievably stupid.

YAb0

@thegibson - in 2002/3 ish I was performing a security assessment of a bank when they insisted I perform the internal Nessus scan during business hours. 30 seconds after the scan started their AIX server crashed taking all business with it. Noticed I had accidentally clicked “enable dangerous plugins” in Nessus.

Yop Solo

@thegibson
I moved IO.SYS and MSDOS.SYS from C:\ on my dad-in-law XT computer for uncluttering the root. After more thinking, I put it back on place but the computer didn't boot anymore after that.

How a file *must* stay on some specific sector? 🤔

YAb0

@thegibson - While soldering a PAL/NTSC switch onto my Amiga 500 I accidentally sneezed and soldered together several pins of the Agnus chip.

It took a few days, and antihistamines before I worked up the nerve to fix it.

Amazingly everything worked out, and I nolonger needed to use software to flip between PAL and NTSC modes, making it a lot easier to play games and demos from anywhere in the world. (at the time it was a big deal)

Azure

@thegibson The first time I sent code upstream as part of a job, it didn't even occur to me to think of whether the variable names might be leaking information about other stuff we were working on. I had the good fortune to have a boss who believes in teachable moments, however.

Deleting all of /usr due to a typo.

Long ago, while I was working on some PHP and SQL while knowing nothing about SQL, someone asked me to add group permissions. So I added a 'group' table containing one prime number per group, and the permissions field on resources or the membership field on a user was just the product of all the relevant groups.

Probably the worst mistake I've ever made in any domain was when I was having psychiatric issues and insisted, against everyone's advice, that I had to jump back in to the thick of things right then and there to restore my honor instead of stepping back and taking a few months to sort things out.

@thegibson The first time I sent code upstream as part of a job, it didn't even occur to me to think of whether the variable names might be leaking information about other stuff we were working on. I had the good fortune to have a boss who believes in teachable moments, however.

Deleting all of /usr due to a typo.

Chris [list of seasonal emoji]

@TheGibson

I once told a customer, "You don't need to be facist about it" when he suggested using DRM for development tools.

I once nuked a machine with a script containing `rm -rf $PARENT/$CHILD` where the variables were unset. (The machine was running Windows 95 and MKS tools. I killed the script before it wiped the machine but it had taken out the registry by that point.)

paul

@thegibson it's very easy to fail in music tech. Admittedly, the stakes aren't as big compared to other fields. But they are still failures.

I was on a team that developed one of the first musical instruments for the Leap Motion. It was in collaboration with the music artist BT, and they had a ton of promotion for it. Thing collapsed under it's own weight. It was terribly built. It never worked.

Speaking of things not working, there was the time I tried to use an OSC controlled instrument I built using WiFi on stage as an opening act featuring Nona Hendryx. It didn't work at all, and I had to stand onstage pretending to perform like an idiot for 20 minutes. I've never felt so helpless and humiliated onstage as a musician. I've sworn off OSC entirely since then.

Then there was the music tech startup I co-founded. It's truly amazing how little got accomplished in such a long period of time...

@thegibson it's very easy to fail in music tech. Admittedly, the stakes aren't as big compared to other fields. But they are still failures.

I was on a team that developed one of the first musical instruments for the Leap Motion. It was in collaboration with the music artist BT, and they had a ton of promotion for it. Thing collapsed under it's own weight. It was terribly built. It never worked.

mhoye

@thegibson Oh, it's been a while since I've hit this one, but that thing where if you type "cp *" accidentally in a directory with only two files in it, and shell expansion says "yes obviously you mean just overwrite that file with a copy of this one" is always chef's-kiss delightful.

Stefan Sperling

@thegibson I was once taken to a costume party along with a bunch of friends. The host was dressed like a pirate, including differently shaded irises in their eyes which appeared like an intentional effect with contact lenses as part of their costume and I thought it looked amazing and cool. So when we left I said something like"I really like your costume and what you've done to your eyes" (very poor choice of phrasing on my part). The host just looked at me and didn't say anything. On the way back I was asked by one of my friends whether I knew what had happened. There had been a self-inflicted accident involving an air gun some months before which had required eye surgery to fix things up... :flan_despair:

@thegibson I was once taken to a costume party along with a bunch of friends. The host was dressed like a pirate, including differently shaded irises in their eyes which appeared like an intentional effect with contact lenses as part of their costume and I thought it looked amazing and cool. So when we left I said something like"I really like your costume and what you've done to your eyes" (very poor choice of phrasing on my part). The host just looked at me and didn't say anything. On the way back...

Wilhelm Fitzpatrick

@thegibson okay, I can now real-time this sucker.

I’m the technical lead on a fairly significant app project that has been going for two and a half years now. We release to the public on Monday (less than 36 hours from now).

This morning, I get an email that we’ve managed to miss that analytics aren’t hooked up on the iOS build of the app that is staged to go out the door. Embarrassing, but we’ll ship a post launch update fairly quickly, and nobody is inconvenienced but marketing.

While putting in the fix for _that_ we stumble over the fact even though folks have been frantically doing all sorts of testing on this app for weeks on end, _nobody_ has thought to test new account creation recently on the iOS build, and it is _broken_ — did I mention this is staged to go out the door in 36 hours?

We’ve now got a fixed for this, but of course there is no telling how long it’ll take Apple review to get this through (and certainly not til sometime on Monday AT BEST)

It’s gonna be a spicy release day, and as the technical leader, I’m ultimately responsible that this is happening.

@thegibson okay, I can now real-time this sucker.

I’m the technical lead on a fairly significant app project that has been going for two and a half years now. We release to the public on Monday (less than 36 hours from now).

This morning, I get an email that we’ve managed to miss that analytics aren’t hooked up on the iOS build of the app that is staged to go out the door. Embarrassing, but we’ll ship a post launch update fairly quickly, and nobody is inconvenienced but marketing.

The Doctor

@thegibson Accidentally crashed an IBM s/390 mainframe with nmap my first week on the job. Took down nearly the entire country infrastructure for a week. Still don't know why my boss didn't fire me. Or defenestrate me.

Killed a mail server by commenting out the tcpserver line in /etc/inetd.conf and took mail down for two days.

Misconfigured mod_proxy on a production box, half of China decided to download porn through it. Found out the hard way that the campus firewall had not, in fact, been installed a year previously. Even though I had a paper trail saying this was bad, I got punished for it because I was only a contractor.

Accidentally used in-house deployment software to uninstall the in-house deployment software in a production environment.

@thegibson Accidentally crashed an IBM s/390 mainframe with nmap my first week on the job. Took down nearly the entire country infrastructure for a week. Still don't know why my boss didn't fire me. Or defenestrate me.

Killed a mail server by commenting out the tcpserver line in /etc/inetd.conf and took mail down for two days.

Jeff

@thegibson

I received approval from our CEO to proceed with building a prototype of some testing equipment that I had designed and presented.

I had assured him that, although it was complex, it would be well within my skillset to complete on time.

I designed and etched the required PCBs and completed the assembly on time and it came time to power it on for the first time.

There was a huge boom as a large capacitor exploded. I had soldered it in with the wrong polarity. 😅

Stefan Midjich ꙮ҄

@thegibson I once had my personal laptop hacked and my "nickname" posted on the hacker's wall of shame. All because I re-used a password from one server on an irc server with a weakly encrypted db. The hacker was so close to thousands of client records for my business at the time, and even credit card info. They should just have used ls -la so I guess this is a double fail.

Rob

@thegibson Today I found out I had a mistake in a Terraform config and it's going to cost several thousand dollars in AWS fees as a result. 😔

swagg boi

@thegibson I told the editors of the 2019 film adaptation of Cats, 'No, leave the CGI buttholes in there... That's what the people want you fools!!'

Mark Shane Hayden

@thegibson I had a bug in a PLC program that caused a blowdown event during the startup of a natural gas plant once that was fun.

Anyways I bet there is a Canuck Schmuck or a few somewhere at Rogers Communications experiencing some pretty Epic Fails today in comparison.

Polychrome :clockworkheart:
@thegibson if you want to see a fail put me a party and wait for ~39 seconds.
🕷🐜:lattentacle:

@thegibson spent years remembering one piece of case law as standing for the exact opposite of what it actually was, only corrected after finally running into a situation where the other side had set up their entire system based on the opposite (correct) interpretation and going with my version would have meant a complete shitshow for them

Buttered Jorts

@TheGibson I would simply choose not to fail. More seriously, I can’t remember any, which is frustrating because I know they happened.

Ellie 🐍☃️ :blobsnowedin:​

@thegibson I spent a not-insubstantial amount of time this evening producing new PRs when I was supposed to be off the clock, because my boss and I had hotfixed something to production to beat a deadline but forgot to merge the updated file to dev, and now the older version that we had merged to dev had made it up the pipeline and was causing merge conflicts trying to get it into production...

Guy Montag

@TheGibson I spend half my time at work making more knowledgeable programmers help me with problems that I could have figured out myself.

I spend the other half driving myself crazy trying to solve a problem myself before asking the more knowledgeable programmers, who explain that it's a simple thing that I couldn't have reasonably known about beforehand.

mhr :donor:

@TheGibson i teach developers security and try to help them with their problems.

during this, i always realize what a bad dev i was. and i made “asking dumb questions” a thing cuz i’m doin it all the time 🙃

William, the 🍵 drinker™

@thegibson i accidentally left a character off an SQL query that made it's way into production. my fix was a 1 character change.

mrpieceofwork

@thegibson does failing to survive in this capitalistic hellhole count? asking for a friend

Solace-10

@thegibson Years ago, shortly after I had joined a company I committed about 150k files into our version control system, SVN. Just 14gb, small potatoes in my previous company. And really if the commit fails, these operations are atomic.

...

Are they really?

Commit fails. Workspace gets corrupted. Any operation against the server returns garbage errors.

ANY operation. Not just mine. I somehow corrupted the corporate database. Server had toe be taken down for the day as IT had to restore it from the daily backups.

@thegibson Years ago, shortly after I had joined a company I committed about 150k files into our version control system, SVN. Just 14gb, small potatoes in my previous company. And really if the commit fails, these operations are atomic.

...

Are they really?

Commit fails. Workspace gets corrupted. Any operation against the server returns garbage errors.

device ⛧

@TheGibson yesterday I was up late to help develop a custom Google App Script for a customer, and to test if it works as expected I set it up to execute every minute. I'm battling a cold right now, so went to sleep and forgot about this almost in an instant.

Today I woke up to hundreds of e-mails about the failures of exceeding the quota 🥲 and also running out of space on my google drive

Vortec Space

@TheGibson Kept my backups mounted on `/home/user/backups` so I could have a fully automatic backup script.

Accidentally ran `rm -rf /home/user`. Deleting all my files, and the backups too.

New rule written in blood: The backup drive is disconnected when I'm not doing backups. I figure this protects me from malware, too?

Annika Backstrom

@TheGibson I separated a bunch of lvm commands with `;` instead of `&&`. One of the commands failed and I destroyed a whole partition of stuff.

mrpieceofwork

@thegibson
Ohhh... that kind of fail! Well, tied to the aforementioned failure to survive, I once, FINALLY, cleaned out my email acct by downloading all 12 years of it into Mac's mail app, on the Mac Pro I had for abt 5 years, that is, until I had it stolen from me in a storage auction (no thanks to my ex nor my bff, neither allowing me to bring it into either house I should have insisted fml)

Keen

@TheGibson I cluelessly mixed all of my different modular PSU cables together in a big box o' cables. I used one to plug in 5x4TB SATA drives, but it had the wrong PSU side pinout. Didn't lose any data and PSU protected the system, but I did lose the magic smoke.

XorOwl 🎃

@thegibson set a cron to * 12 * instead of 0 12 * and sent an email to all my clients 60 times in an hour
OTL

Jeremy Worrells

@TheGibson Reading through the replies brought a couple of my own to mind:
I wrote a script on a Sun workstation in our lab that would play a cuckoo sound on the hour. Something was off in my logic and I came in late the next day to the box incessantly cuckooing and my coworkers very irritated.
Even earlier in my life, my boss wanted some free space on the main work computer. This was 1992/1993 timeframe. Turns out that Windows really dislikes it when you delete all of System32.

Nils 'Kojote' Hitze

@thegibson I once double booked 200+ credit cards of our customers - that was after I accidentially deleted all stored "valid until" dates from our customers credit cards. Magic

Go Up