Post your fails.
Pwn your imposter.
195 comments
@thegibson Failed so far at being a well-adjusted to society "normal" adult with "normal" goals and expectarions in life. Not sure that's such a bad thing though. @TheGibson I knock out our SharePoint farm on a fairly regular basis. 90% of the time it’s because I forgot to rate limit a script and just told it to move/delete 10k files in the space of a few seconds. It will happily accept the requests and start processing in the background until it can’t any more. Then it goes splat… @yojimbo @TheGibson hah! It's been a weird day of unexpected memories. This one was pretty memorable, for sure. I felt very bad for you, as I think you did take a lot of precautions.. and I felt very bad for the customer. We continued working with them until I sold the business quite a few years later. I don't think it was discussed much subsequently, but I don't think it affected them too badly. :). I hope it hasn't troubled you overmuch since. @TheGibson one of my first projects at Amazon (in 2000) was to put a wishlist button everywhere there was an add to cart button. I reversed the “add to cart” v “preorder” logic on music search, probably cost ~$100k in sales over 24hr. More than twice my annual salary at the time. @tithonium @thegibson I too have caused a NA Order Drop! Me: *clicks button* @TheGibson Building a datacenter network on Juniper Virtual Chassis, after initial testing seemed promising. It saved us a ton of work at a time where network automation was not yet a thing, and brought us into the 10G (EX) and 40G (QFX) era... But after half an hour of downtime during one of the "non-stop software upgrade" procedures (that was the final straw, there were problems every time before), no one dared touch the things for another upgrade for years until the platform was replaced. @TheGibson I got a job offer after spending some (too much) time on writing a Perl script that did traffic accounting from TIS FWTK log files, 1995-ish. It worked quite well, until one day it didn't, and all the output was garbage. I didn't know about integer overflows. Luckily for me, Math::BigInt showed up somewhere around the time that happened. @thegibson updated around 4 million records in production with bad data. We have robust rollback mechanisms, so it only took about a day to fix. @TheGibson In the first few months at my first job (at a online payment provider) I accidentally sent around 110000€ to the wrong people I think we got like 107000€ them back @thegibson i have 2 @thegibson @thegibson @thegibson @thegibson First day on a gig, I broke some libraries that a bit over a hundred people were depending on because I didn’t understand their dev process. Then went to orientation for the next two hours and was unreachable while they tried to figure out what just happened :). Unbroke it in five minutes when I found out at least @TheGibson fucked up a raid and lost a weeks worth of accounting data for three companies. @xorowl @thegibson the sign of a good company. Good decisions come from experience. Experience comes from bad decisions. You just gained a bit of experience. :-) source local.env "Huh... it's taking longer than usual" export "Well, that's probably not great..." @thegibson i directly edited database on production server and gave the client an unexpected april’s fool @TheGibson Early in my writing career at a moderately popular Apple news site (TUAW) I wrote about what I thought was a silly little piece of shareware that opened the CD drive on Mac towers. Turns out it also ran a script that moved ALL your calendar events back by two weeks. Readers were not happy. The person who made it never intended for it to get picked up in the news. They gave me a script readers could use to fix it all. @chartier @thegibson
Used to check TUAW daily when I had my first Mac. (Aluminum PowerBook G4) :blobthumbsup: @chartier I spent way too much time reading TUAW back in the day. Really appreciated all yall's work. Learned the hard way that `killall` on AIX is not the same as `killall` on Linux. I use `pkill` now instead... @thegibson Brought down a school's wireless infrastructure for 2 days because I did not yet understand VLANs. Did not leave campus until every switchport was manually repatched, labeled, and tagged/trunked. @thegibson One time, at my $dayJob, I completely wiped out website without backing it up first. Luckily, the host does a backup every 24 hours so we were able to get it back up in just a couple hours. Why didn't I back it up before I did what I did to wipe it? I don't know. Overconfidence in my abilities I guess. @thegibson Nearly 20y ago, I was the new guy at Openwall/GNU/Linux. Wanted to commit my first packages and documentation, however, my CVS experience was quite low. I ended up creating directories in the main repo and committing crap that the mighty Solar Designer then had to revert. He even explained CVS basics to me. I was so embarrassed... @thegibson Maybe too esoteric, but I once spent weeks (maybe a month) trying to track down a bug in my verilog code for an FPGA on a new board bringup. I was stressed, the teams was stressed. I was rewriting entire modules and theorizing strange metastability bugs. In the end, the problem was that I had neglected to check a particular box in the configuration for the project so that the generated bitstream would kick the FPGA out of its initialization sequence after it loaded the bitstream. :/ @thegibson I also once walked across a carpeted lab, reached out my hand to push a reset button on an expensive electronic board I was working on, and a nice static discharge arced from my finger before I could hit the button. ALL the smoke started to come out of that board. 😆 @thegibson not my worst one, but the funniest one: Fucked up the computers of a thousand+ of our lanparty guests, since we set DHCP option 46 to only use WINS and not broadcast for NetBIOS name resolution (p-node). Windows set that flag permanently and people were not able to find other computers back at home anymore. Solutions was of cause editing something in the registry by hand to restore the original behavior. Took us a week to figure that out, people were pretty pissed 😅 Not tech related: Roommate/homeowner wants to tile his countertops. We measure everything carefully. Note down the sizes of each section in inches. Add them all up. OK, this tile is sold in square feet. No problem, just divide by 12... We had a LOT of extra tile. @thegibson IBM cash registers of old used a version of Token Ring called Store Loop. I was replacing an IBM cash register base on Black Friday and instead of unplugging it from the IBM data connector on the wall (which would self-short to prevent opening the loop) I unplugged it from the back of the register. Every cash register in the store immediately stopped. Hard. All the cashiers went "Hey, did your register stop?" I said "Yes, not sure what happened." as I hastily plugged it back in. @gedvondur You might be interested in the story in this post: https://www.solipsys.co.uk/new/GracefulDegradation.html?ve13mn @ColinTheMathmo @thegibson Huh, interesting stuff! I wonder if a backpressure mechanism would work too. Fibre Channel had a backpressure mechanism that would say "stop, I'm full" when the buffers hit max, because FC was a lossless protocol. @gedvondur The problem in this case was that the comms system couldn't handle the traffic being put in the front end. So somehow either the processing capacity needed to be increased -- which wasn't possible -- or the input needed to be reduced. Slowing it gently across the entire store worked well and was implemented elegantly. @gedvondur The underlying user-facing problem was that flow of data through the system has to be continuous and couldn't be seen to stutter at any point. So making the tills tun a bit slower seems the only thing to do. @ColinTheMathmo @gedvondur @TheGibson that made me think of garbage collection, and the similar trade-offs in play there @thegibson DNS on the wrong network interface. I think it took 2 days for the university to get back online... @TheGibson another one I thought of. I dd my home partition on my computer and didn't have up to date backups for it. @thegibson I botched DNA extraction on samples that had been hand-retrieved from an endangered species. @thegibson me? right now today is a mistake. i made a poll post for peeps to decide what i should do in my free time... included coding and playing games... @TheGibson not 100% on me, but I ran the script without fully understanding someone used a fixed string length instead of a delimiter on a big DB insert, and wound up garbling an entire database full of medical record numbers once the character strings switched from 7 to 8 in length. This is why we no longer let regional field people wing it on customer requests “correct” database contents with their own scripts. @TheGibson At one of the last Moscone Google I/O events, I was standing around with a few other journos. A Google PR person introduces someone to the group and I start chatting and asking some very small talk questions. Everyone else starts acting very weird. It's clear this dude is someone important and I haven't recognized him. I shut up. Later, my colleague filled me in on who he was: Larry Page. @TheGibson Names/faces are not my strong suit. I am constantly re-introducing myself to people. 😬 @thegibson spent til 2am on phone with vendor for an 2.5PB HPC storage system, reading kernel panics and module code. Clients and servers had gotten to a state where the metadata servers just boot-looped, blocking all work. Eventually found complex workaround to use til bug was fixed by vendor. Got nice amount of recognition for my efforts and skills... Next day approved & applied a change that obviously & trivially brought it down again, costing 100s of scientists 10s of Kcpu- hours of work. @thegibson foolishly I stayed in that HPC job for a while longer. Somehow (despite general lack of networking background) was made responsible for overseeing a wholesale re-wire of the Infiniband Network for expansion of a cluster. Rats nests of cables... my oversight failed to catch a label with '6' being read as '9' by a colleague. Didn't have a downtime impact. But took literally weeks for someone else to find and correct the performance issues that resulted. Unmanaged IB switches suck. @TheGibson @thegibson was helping a business team out just because I ran the MFT server... they asked me to send this file- just a one off- and I put it in the outbound queue manually, bypassing some technological controls that I had implemented myself (i.e. file name had to have the client name (that got stripped out) in it, in the same client folder, etc). @thegibson The next eventing a client asked what this unexpected file was. I investigated that night. I didn't sleep well. I didn't eat dinner. I didn't eat breakfast. I camped by my laptop in the morning waiting for my sr mgr to log in. then I messaged him "We need to talk, now. It's important". @thegibson Obviously to the reader, I dropped it into the wrong queue. It's work I wasn't supposed to be doing, just helping out coworkers at a deadline, but it technically became a data breach. The executive partnership incident response team had to be notified and the client had to be notified. We were lucky, it wasn't a PUBLIC breach, and the data was... "hard" to make sense of without a few years of context. @thegibson I have so many I can't remember them. I'm certain that I've done the "rm -rf / path/that/I/meant/to/delete" classic. I've cost users time and email. I've lost data to bad RAID configs or mis-remembering which was target drive and which was source drive. I've had over 35 years to make mistakes. The thing I hope is that I don't make the same one twice. @thegibson there's been too many over the many years but the one i'm still currently cringing about is screwing up a formula in a spreadsheet and thus assigning 200 undergraduates incorrect grades had to get the registrar's office to go into the database and delete them all Your chart is ready, and can be found here: https://www.solipsys.co.uk/Chartodon/108295593183482710.svg Things may have changed since I started compiling that, and some things may have been inaccessible. The chart will eventually be deleted, so if you'd like to keep it, make sure you download a copy. @thegibson there isn't a character limit high enough in the world to capture my failures. Here's a crucible moment though, from my first or second year in the business: Working at a regional ISP in the 90s. I zeroed out /etc/passwd on the mail server. predated virtual users so every customer had a system account. Thought I had a) lost my job and b) killed the business. Boss said to me, "Well, now you're a real sysadmin. Go get the install CDs." @steffen @thegibson Without the ability to log into the box restoring the backup was impossible. I don't recall now if we yanked the drive, mounted it and manually repaired /etc/passwd or did an in-place OS install and then restore a backup or what. 25+ years ago :D @steffen @thegibson although I do recall we had backups on tape because I remember having to drag the scsi tape drive out of storage from time to time @kitsune @thegibson If you haven't tested backups, @TheGibson Porting the NAG serial ForTran library to the Parsys Supernode, I had a directory with a lot of files. $ "ls -al" ... It took a while, then a while longer, then people started to raise their heads and wonder what was happening. Turned out that "ls" was compiled with a fixed size table that I'd just overrun. By a lot. And by coincidence, smashed the stack and overwrote some important OS code. It wiped the main disk, and everyone's work. After that when anyone saw me coming ... @TheGibson ... they would back up their work across the network[*] and logoff. [*] The networking could only transfer files, so remote logins weren't possible. That had the benefit that you could see everyone else who was using the machine. I always had the machine to myself after that, and I paid others the courtesy of letting people know when I'd be working on it so we could coordinate timings. It really wasn't my fault, but it remains an interesting story. @thegibson It was the early nineties while I was in university and I wanted to try out linux on my dad's 486dx2-50. Dual boot on a separate partition went well, and I even set up a mount point for the dos/win partition. After a few weeks Dad needed more diskspace on the dos/win, so I decided to wipe the linux partition... as a first step I decided `sudo rm -rf /` was a pretty good move. @thegibson Early in myWindows admin days. Using group policy editor to remove a policy from the admins group. "Are you sure you want to do this?" Of course, remove the policy from this group. Nope, it removed the group, and all of the accounts within it. All in the admin group, gone, including me. Makes perfect sense that the GP editor would remove the group and all members. Luckily a few people in the network group had the rights to re-create one of us to bootstrap back. Shame on me. @thegibson completely destroyed production Kafka cluster, more senior Ops engineers managed to fix it. But the very next day I did it again @thegibson Luckily this was an unused old board that I just had lying around in my room, not anything used by anyone else, or even me. Still, I feel bad for wasting a working motherboard like that with my carelessness. @emacsen@emacsen.net @thegibson@hackers.town oh so you're the reason we have --no-preserve-root 🤣 @TheGibson Was managing a terrible HR application and got overconfident while applying patches, went for the next one without a snapshot and it failed miserably (it didn't in our test env). The HR department (hundreds of employees) was left unable to do any work for half a day. We didn't know at the time if it was going to be an easy fix or something that would let us struggle with our backups for *days*. Been treating prod environments like unexploded bombs since then, which is what they are. @thegibson wired about 10 lightbulb sockets backwards because at the time I thought 110/120v AC was done the same way that 220/240v AC was done, so wouldn't matter which direction I hooked them up, so connected them randomly. @qwazix @thegibson there's a neutral line and a hot line. the hot one bounces and the neutral just sits. 220/240 is done by having two hot lines in opposite phase. @qwazix @thegibson I was talking about how they do 220/240 in the USA, I figure where you're at they do the 110/120 style, but just with a higher voltage. @epoch @TheGibson ah, that makes sense. However we can still wire the light bulbs either way. @qwazix @thegibson I know that a lot of things work in either direction, but for some reason, either the sockets or the specific type of bulb, these didn't light when hooked up backwards. I'm thinking it might be they were LED bulbs that wouldn't work backwards. I haven't tested the bulbs and the sockets independently to figure it out exactly. Guess I could find a lamp that doesn't use a polarized plug and an LED bulb to test it real fast. Deleted all shared libs on a production Solaris 10 system. Was able to rebuild the config using the equivalent of Linux "ldconfig" and rebooted the machine before anyone realized. Solaris ran fine after that. I also did an "rm -rfv .", not realizing I had changed to ~ instead of a subdir. Lost quite a lot of important files that way. Yes, no backups. @nchprgmng @TheGibson ive filled my /boot so many times that expanding it to a gig or two has been baked into my setup process. @nchprgmng @TheGibson this was hilarious to stumble on, this exact thing happened in my environment a few months ago (in my defense, I inherited it that way and will fix it I swear) @thegibson Long ago, I created a website with a webmaster email address on it that forwarded to my boss. Then one day my boss went on vacation, so he set up an autoresponder. Someone mailed the webmaster, it forwarded to the boss, which replied to the webmaster... Which forwarded to the.boss. And I brought down the company mail for a while. Not my worst fail though. My worst was owning the component that brought down healthcare.gov when it launched. Oops. @thegibson because of the way I had configured drbd, I once lost at least a week's worth of customer emails. anyway, don't put cluster things on only two nodes. 1. Phys Ed Teacher: run around the perimeter of the court, passing and bouncing the ball Me: If i track the painted line in my peripheral vision I can watch only the ball, and stay on the edge! Steel pole: is on the painted line Skull, nose, teeth: *crack* great alignment there! 🤕 3. Not defanging email on a test environment. 250K people received my test email. Lesson learned! @thegibson without going into the details, there are plenty of easy ways to learn that “killall” means very different things to different unixes, but I did not choose any of them. @thegibson @nchprgmng One Time I was patching a cluster by hand, got lost in my RDP sessions, shut the host down. Had to call the customer at 3am to go to the DC to turn it back on @TheGibson More than once I have pushed a change only to watch in horror as it did what I said and not what I intended. More than one there have been .01, .02, ... .09 releases of the same release. @TheGibson I played PWEI's "Not Now James, We're Busy" on college radio. It's got a very non-FCC-approved line in it. That sailed out onto the airwaves. I thought i was a goner when the phone rang. It was the station manager. He asked if the song had a "No" through it. It did not. He then instructed me to mark that track as a "no" and that was the end of it. @thegibson I left an okay job with excellent insurance and benefits for a seemingly better one with weaker benefits but better pay. I showed up for my first day and the equipment they promised didn’t exist. I used my own equipment for the first three years. o_O @thegibson I wanted to change where the coax line from the cable company entered the house, so I drilled a new hole into the basement. Suddenly, POP!!! The drill pops back. The drill bit was forcibly ejected. It is burning, somehow. The lights in the house are out. I drilled into the main power line going from the meter to breaker box. I am both incredibly lucky and unbelievably stupid. @thegibson - in 2002/3 ish I was performing a security assessment of a bank when they insisted I perform the internal Nessus scan during business hours. 30 seconds after the scan started their AIX server crashed taking all business with it. Noticed I had accidentally clicked “enable dangerous plugins” in Nessus. @thegibson How a file *must* stay on some specific sector? 🤔 @thegibson - While soldering a PAL/NTSC switch onto my Amiga 500 I accidentally sneezed and soldered together several pins of the Agnus chip. It took a few days, and antihistamines before I worked up the nerve to fix it. Amazingly everything worked out, and I nolonger needed to use software to flip between PAL and NTSC modes, making it a lot easier to play games and demos from anywhere in the world. (at the time it was a big deal) I once told a customer, "You don't need to be facist about it" when he suggested using DRM for development tools. I once nuked a machine with a script containing `rm -rf $PARENT/$CHILD` where the variables were unset. (The machine was running Windows 95 and MKS tools. I killed the script before it wiped the machine but it had taken out the registry by that point.) @thegibson Oh, it's been a while since I've hit this one, but that thing where if you type "cp *" accidentally in a directory with only two files in it, and shell expansion says "yes obviously you mean just overwrite that file with a copy of this one" is always chef's-kiss delightful. I received approval from our CEO to proceed with building a prototype of some testing equipment that I had designed and presented. I had assured him that, although it was complex, it would be well within my skillset to complete on time. I designed and etched the required PCBs and completed the assembly on time and it came time to power it on for the first time. There was a huge boom as a large capacitor exploded. I had soldered it in with the wrong polarity. 😅 @thegibson I once had my personal laptop hacked and my "nickname" posted on the hacker's wall of shame. All because I re-used a password from one server on an irc server with a weakly encrypted db. The hacker was so close to thousands of client records for my business at the time, and even credit card info. They should just have used ls -la so I guess this is a double fail. @thegibson Today I found out I had a mistake in a Terraform config and it's going to cost several thousand dollars in AWS fees as a result. 😔 @thegibson I told the editors of the 2019 film adaptation of Cats, 'No, leave the CGI buttholes in there... That's what the people want you fools!!' @thegibson I had a bug in a PLC program that caused a blowdown event during the startup of a natural gas plant once that was fun. Anyways I bet there is a Canuck Schmuck or a few somewhere at Rogers Communications experiencing some pretty Epic Fails today in comparison. @thegibson if you want to see a fail put me a party and wait for ~39 seconds.
@thegibson spent years remembering one piece of case law as standing for the exact opposite of what it actually was, only corrected after finally running into a situation where the other side had set up their entire system based on the opposite (correct) interpretation and going with my version would have meant a complete shitshow for them @TheGibson I would simply choose not to fail. More seriously, I can’t remember any, which is frustrating because I know they happened. @thegibson I spent a not-insubstantial amount of time this evening producing new PRs when I was supposed to be off the clock, because my boss and I had hotfixed something to production to beat a deadline but forgot to merge the updated file to dev, and now the older version that we had merged to dev had made it up the pipeline and was causing merge conflicts trying to get it into production... @TheGibson I spend half my time at work making more knowledgeable programmers help me with problems that I could have figured out myself. I spend the other half driving myself crazy trying to solve a problem myself before asking the more knowledgeable programmers, who explain that it's a simple thing that I couldn't have reasonably known about beforehand. @TheGibson i teach developers security and try to help them with their problems. during this, i always realize what a bad dev i was. and i made “asking dumb questions” a thing cuz i’m doin it all the time 🙃 @thegibson i accidentally left a character off an SQL query that made it's way into production. my fix was a 1 character change. @thegibson does failing to survive in this capitalistic hellhole count? asking for a friend @TheGibson yesterday I was up late to help develop a custom Google App Script for a customer, and to test if it works as expected I set it up to execute every minute. I'm battling a cold right now, so went to sleep and forgot about this almost in an instant. Today I woke up to hundreds of e-mails about the failures of exceeding the quota 🥲 and also running out of space on my google drive @TheGibson Kept my backups mounted on `/home/user/backups` so I could have a fully automatic backup script. Accidentally ran `rm -rf /home/user`. Deleting all my files, and the backups too. New rule written in blood: The backup drive is disconnected when I'm not doing backups. I figure this protects me from malware, too? @TheGibson I separated a bunch of lvm commands with `;` instead of `&&`. One of the commands failed and I destroyed a whole partition of stuff. @thegibson @TheGibson I cluelessly mixed all of my different modular PSU cables together in a big box o' cables. I used one to plug in 5x4TB SATA drives, but it had the wrong PSU side pinout. Didn't lose any data and PSU protected the system, but I did lose the magic smoke. @thegibson set a cron to * 12 * instead of 0 12 * and sent an email to all my clients 60 times in an hour @TheGibson Reading through the replies brought a couple of my own to mind: @thegibson I once double booked 200+ credit cards of our customers - that was after I accidentially deleted all stored "valid until" dates from our customers credit cards. Magic |
@thegibson@hackers.town have you not SEEN the crap i post??
i post it because it's a learning experience -- i just hope someone else is able to learn along with me