Why HN was down

1049 points by pg 13 years ago

Hacker News was down all last night. The problem was not due to the new server. In fact the cause was embarrassingly stupid.

On a comment thread, a new user had posted some replies as siblings instead of children. I posted a comment explaining how HN worked. But then I decided to just fix it for him by doing some surgery in the repl. Unfortunately I used the wrong id for one of the comments and created a loop in the comment tree; I caused an item to be its own grandchild. After which, when anyone tried to view the thread, the server would try to generate an infinitely long page. The story in question was on the frontpage, so this happened a lot.

For some reason I didn't check the comments after the surgery to see if they were in the right place. I must have been distracted by something. So I didn't notice anything was wrong till a bit later when the server seemed to be swamped.

When I tailed the logs to see what was going on, the pattern looked a lot like what happens when HN runs short of memory and starts GCing too much. Whether it was that or something else, such problems can usually be fixed by restarting HN. So that's what I did. But first, since I had been writing code that day, I pushed the latest version to the server. As long as I was going to have to restart HN, I might as well get a fresh version.

After I restarted HN, the problem was still there. So I guessed the problem must be due to something in the code I'd written that day, and tried reverting to the previous version, and restarting the server again. But the problem was still there. Then we (because by this point I'd managed to get hold of Nick Sivo, YC's hacker in residence) tried reverting to the version of HN that was on the old server, and that didn't work either. We knew that code had worked fine, so we figured the problem must be with the new server. So we tried to switch back to the old server. I don't know if Nick succeeded, because in the middle of this I gave up and went to bed.

When I woke up this morning, Rtm had HN running on the new server. The bad thread was still there, but it had been pushed off the frontpage by newer stuff. So HN as a whole wasn't dying, but there were still signs something was amiss, e.g. that /threads?id=pg didn't work, because of the comment I made on the thread with the loop in it.

Eventually Rtm noticed that the problem seemed to be related to a certain item id. When I looked at the item on disk I realized what must have happened.

So I did some more surgery in the repl, this time more carefully, and everything seems fine now.

Sorry about that.

DanielBMarkham 13 years ago

Amazing that such a large percentage of debugging involves determining exactly what you are debugging. The definition of the problem, many times, is the solution.

Might be a good time to mention Rubber Duck Debuggging. http://en.wikipedia.org/wiki/Rubber_duck_debugging

  • furyofantares 13 years ago

    There is a line from Futurama that perfectly applies ton a lot of debugging.

    Farnsworth: My God, is it really possible?

    Fry: It must be possible, it's happening.

    Fry: By the way, what's happening?

    • banachtarski 13 years ago

      Is this the forward time machine episode??

      I love futurama more than any man could love any tv show.

      • short_circut 13 years ago

        I feel like your name reflects that fact. It seems to be a reference to the Banach-Tarski duplashrinker

        • ztravis 13 years ago

          More likely it is a reference to its eponym, the Banach-Tarski paradox.

      • furyofantares 13 years ago

        It is -- it's when they are observing the second big bang.

    • mhurron 13 years ago

      Extremely appropriate as Fry is his own grandfather and the site software can't handle that relationship.

      • Aardwolf 13 years ago

        That's one of my favorite lines from Futurama, 'Ohh, a lesson in not changing history from Mr. "I'm My Own Grandfather"!'

  • davesims 13 years ago

    A few times a month, I'll look up at one of my colleagues and say, "hey, got a sec? I need to talk to the duck," and they know this means I'm going to talk to their head but they can basically keep doing what they're doing and nod occasionally.

    This serves several purposes:

    (1) It's less insane-sounding than actually talking to an inanimate object in an open work environment.

    (2) It actually feels better and forces me to think more clearly when I'm talking to an actual person -- the cognitive focus is higher when the object of conversation can actually, in theory, think and talk back (YMMV).

    (3) And finally, although it does require some focus on the part of the other coder, it's not nearly as taxing to them as actually helping me solve the problem or pairing up with me.

    So it's a good compromise somewhere between pair programming and talking to an actual rubber duck. Again, YMMV. Maybe I'll call it "Pair Ducking."

    • kamjam 13 years ago

      This is so true, probably about 90% of the time my colleagues call me over about a problem they are facing, explain in detail what the problem is and then eureka! Most of the time it would actually take me much longer to figure out the exact issue since I don't know the ins/outs and subtleties of the code but it's exactly as you say.

      Of course, it makes me look really good cos I just "helped" them solve their issue :)

      • cientifico 13 years ago

        Maybe because the solution is in asking the good questions.

        • kamjam 13 years ago

          Yup, for sure it is, sometimes just having someone else there and having to run through all the steps for them points out the obvious. There was another user further down that said he solved a lot of his own problem just typing them out with enough detail to be able to post on StackOverflow. Same principle.

    • Swizec 13 years ago

      Working from home I tend to just write out my thoughts on a piece of paper. It works perfectly.

      • pseut 13 years ago

        I send myself an email for the same purpose and set up an alias for /dev/null. It makes keeping notes a lot easier since I have the copy in sent mail and can reply to it as needed in the future. (this approach seemed less crazy before I typed it out...)

      • timv 13 years ago

        When I work from home (and I did it exclusively for 8 months of last year) I ended up talking to my wife (my 1 yo daughter wouldn't stand still for long enough).

        My wife is a social worker by training, so it was pretty rare (though not unheard of) for her to be able to give me real input, but over the years I've trained her well enough to follow most of what I'm saying and nod at the right points :)

    • jsmeaton 13 years ago

      On occasion, I'll write a question on stackoverflow and re-read it a few times before hitting submit just in case I get that eureka moment. I think I've written way more non-submitted questions than submitted questions.

      • kyrias 13 years ago

        You should consider submitting the question and answering it yourself, might help someone else.

      • Brandon0 13 years ago

        I do this a lot too. In fact more often than not I end up not posting the question because either I solve the problem or I think of a possible solution I should go try first. I think we should start calling it Digital Rubber Ducking.

        • hfsktr 13 years ago

          I am looking for an excuse to make digitalrubberduck(y/ie).com

          If I was a bit more clever I feel like there is a use there.

          • tekromancr 13 years ago

            Just make a programming themed chatterbot,drop some ads (If you are so inclined), and you are golden! Someone posted a plugin for IntelliJ that allows you to do just that. I would use a web based one of it existed.

            • danohuiginn 13 years ago

              I have occasionally used eliza for this (the basic chatbot in emacs and elsewhere). I'm sure with a bit of tuning you could make a debugging-centred variant of it.

              • tekromancr 13 years ago

                What do the Unit Tests say?

                Have you tried running a debugger and stepping through the code?

                Hmm, go on.

                Wait a sec, can you reexplain that last bit?

    • confluence 13 years ago

      I call it the House method :)

      You bring a detailed problem and break it down, and talk about it to someone else (who often isn't qualified to answer your questions due to knowledge/time constraints) - and in doing so - resolve the problem by challenging one's own assumptions.

      This was effectively how every House episode was resolved.

      • bbrizzi 13 years ago

        When I'm stuck on a problem for way too long, I start typing it out in Stack Overflow. Usually by the time I'm done describing it, I've already solved it.

      • troebr 13 years ago

        Haha, we called it the House method too, we ended up making a cardboard humanoid for when nobody was available.

    • enjo 13 years ago

      I have a whiteboard in a closet I use for this... seriously...I go in there and start jotting down notes and talking to myself.

      It fascinates me how quickly I usually find the answer.

  • RyanMcGreal 13 years ago

    I'm not sure if rubber duck debugging would have helped here. The problem was in the data, not the code. (I know, I know: in Lisp code is data.)

    • DanielBMarkham 13 years ago

      Yep. I thought this through as I was typing my comment.

      (There must be some joke involving the use of a meta-duck, but I can't come up with it. :) (Same principle applies, of course, just LISP makes the determining of "what" a bit more tricky. (insert discussion here about the general differences between debugging imperative and functional code)))

    • chernevik 13 years ago

      That's exactly the sort of thing a duck will tell you. "So what has changed? Let's see, new code, new server, and I fixed the commenter's comment. That was simple I just hard-hacked the comment id and . . . excuse me a second."

      • RyanMcGreal 13 years ago

        Good point. I was thinking in terms of going through the code line by line, which if anything would lead you away from the trail.

    • cube13 13 years ago

      Rubber duck debugging may have actually been the distraction that caused pg to make the mistake, too!

  • z-e-r-o 13 years ago

    For me, this should be called stackoverflow debugging. I genuinely solved a lot of my problems by trying to write a _good_ question on SO about my problem. The problem seems really difficult when I try to ask it in one sentence, just out of my head. However once I try to describe the background, what I'm trying to achieve, what I'm using, when does the problem happen, simplified down to sub-cases, usually by the time I'd be 80% ready with writing the question, I realize the answer.

    • aroman 13 years ago

      Likewise for me, but with IRC. Though I suppose I should try asking on SO first to save myself the semi-public embarrassment ;)

    • Semaphor 13 years ago

      That happens to me a lot. Most of the time I just formulate the question I have in my head into something coherent and by that point I either have the solution, know what to search for or, in case it's not a question but a comment, I realize it's not worth saying.

    • elcodedocle 13 years ago

      Same here: I've been working on a couple of projects by myself for the most part of last year, and when even the duck failed, I could usually figure out an answer just by trying to find the words to post my problem in SO in a way somebody would take the time to read it and be able to answer it. I don't recommend it as a first approach, though, since it's quite time consuming (Or maybe I should blame it on not being a native speaker...)

    • robotic 13 years ago

      I've made several posts to SO and then realize the answer moments later. I usually just self-answer.

    • s_baby 13 years ago

      Yup, the incentive is there to state your problem as clearly as possible to get back a good response. By doing this I answer my own question half of the time.

    • confluence 13 years ago

      I'm a serial SO self-answerer. I write really in depth, complicated questions for complicated problems, with code, data and testable cases - and by the time I've finished the question and posted it - I've figured out the solution - or I'll have it a few hours later.

      I usually just leave the question/answer online so that others can benefit for it.

  • biot 13 years ago

    People at work are amazed when I successfully debug an issue over the phone. In reality, it amounts to 50% experience plus another 50% of Sherlock Holmes: "When you have eliminated the impossible, whatever remains, however improbable, must be the truth". Once you've identified what you're dealing with via a few strategic questions, it becomes simple quite rapidly.

    • 65a 13 years ago

      Debugging is often best accomplished as a binary tree search aimed by familiarity/experience. Once you can put bounds on the search, it becomes possible to get the answer in just a few questions.

      Totally agree.

      • hobs 13 years ago

        This is the best way to work through debugging/troubleshooting as far as I can tell, amazingly a skill many people lack, and others that just understand it intuitively without it ever thinking about it. That is one of the big divisions between hackers and everyone else in my mind.

        • chii 13 years ago

          i think perhaps some people just don't see the world in a hierarchical way, so in their frame of mind, the problem is intractable.

    • kamjam 13 years ago

      One of my favourite debugging tips has always been "give everyone full access to the folder/service" and see if the problem is "fixed". If so revert and now apply the correct permissions. I've seen this come up so many times although my superiors always complained "it's not the right way to do it", whilst I agree "every full access" is bad, this was for debugging purposes only!

    • snowwrestler 13 years ago

      It's amazing how often I am able to fix problems by simply trying all the possible solutions--often while colleagues are saying things like "stop wasting time, it can't possibly be that." But of course often it is "that".

  • mvzink 13 years ago

    This is also why pair programming is so great.

    • ryen 13 years ago

      Not easy on a Sunday while at home.

  • amorphid 13 years ago

    This. Even with the best test coverage in the world, you still bump into edge cases that you couldn't have predicted. As a former QA Engineer, I used to say there's still room for QA in a test driven environment. Now I say there's no replacement for a sharp mind with enough knowledge, curiosity, and good judgement.

  • Someone 13 years ago

    Amazing? for anyone who has read Polya's "how to solve it" (http://en.wikipedia.org/wiki/How_to_Solve_It), that is hardly surprising.

    If you don't understand your problem, you can't make a plan. If you can't make a plan, you can't execute it.

    Another interesting lesson from that book is that one should spend time on evaluation (how did this come about? Could We have fixed this sooner? How are we going to prevent it in the future?)

lifeisstillgood 13 years ago

There are a number of comments that add up to "what steps will you take to ensure this does not happen again" - akin to a incident review. As speculation that's fine, as advice, I don't think it should be listened to.

I am reminded of an long-in-the-tooth sysadmin of my acquaintance who logged in everywhere as root. His theory - "they are my boxes. I screw it up, I fix it." I eventually realised that typing sudo every time he touched a box was no defence against doing the wrong thing.

An awful lot of sites at 1.2m views would have outsourced the running and development of the whole thing - there are entreprenuers who say its not even worth our time to code up the MVP. I find this approach sensible from a business point of view, but still it does not sit right with me.

I am supposed to have a nice website with lots of good content to attract inbound marketing - so I tried getting someone on textbroker to write an article for me. It read like a High School essay - no life, no anime. And so I will probably write my own CMS and my own content.

And pg sits there and writes his site in his own language, with his own moderation tools. Apart from the hilarious idea he could find a ten person ruby shop to outsource to, its nice to see someone taking the time to play again. Its why I like to see jgc on here too.

I am not entirely sure those thoughts are joined up (I am procrasting like crazy) but if they come to mean anything its we are playing in pg's sandbox. If the sand leaks it's his sand, and the only company this is mission critical to is YC.

  • stcredzero 13 years ago

    > It read like a High School essay - no life, no anime. ... I am not entirely sure those thoughts are joined up (I am procrasting like crazy)

    Your procrasting like crazy has much anime.

  • luser001 13 years ago

    > I eventually realised that typing sudo every time he touched a box was no defence against doing the wrong thing.

    IIRC, sudo logs all commands to syslog. Which might come in handy. Yes, root commands will be logged by bash to .bash_history, but there are limits of # of commands lots, what happens if you are logged in multiple times into same account etc.

    Anyway, that's why I like sudo.

    • UnoriginalGuy 13 years ago

      Plus security. With root login disabled a remote attacker won't have a known username to attack.

    • wpietri 13 years ago

      Beyond the logging, which I love, I use it for the differentiation of states. I'm just a little more attentive when I type "sudo" before something.

      At work, a relatively young engineer accidentally typed a command meant for a test database server into a production window. There was a big rush to restore from backups, and there was a small amount of data loss.

      One thing that came out of the retrospective, requested by the engineer in question, was the production hat. Before you opened up a connection to a production machine, you had to put on the large pirate hat. You could only put it back when you had closed the connections. I didn't really need it, but it was a great way for people to learn the necessary caution.

      It also ended up being a nice exclusive lock on futzing with production, and seeing it in use led to some good discussions that otherwise might not have happened. But the main thing was developing a strong differentiation of states in everybody's heads.

      • darkarmani 13 years ago

        Another cheap solution is colorizing the prompt red for dangerous consoles and green for dev machines. This makes it very easy to notice when you selected the wrong terminal.

      • Spoom 13 years ago

        I use a different background color for the terminal in question; in this case, I use a dark red to signify a production system, and a dark blue to signify a development system. I find it quite useful. You can do this through Xterm profiles (Edit -> Profiles, Terminal -> Change Profile) in Linux and OS X, and I'm sure there's a way to do it in Windows / PuTTY.

      • lifeisstillgood 13 years ago

        I have tried the red / green console but never a pirate hat.

        My son now definitely thinks work is like his school :-)

    • kisielk 13 years ago

      I find most of the time I'm using sudo I don't want to type it before every command and so I use sudo -i which pretty much negates the benefit of logging anything other than to tell that I was sudo at some point.

  • lmm 13 years ago

    Typing sudo won't save you, but using a higher-level interface will. Everyone I've ever known to change something in the database by hand, everyone at all, even on a hobby project that they know like the back of their hand, has screwed it up sooner or later. At some point the pain tells you you should stop doing that, and you create an admin tool that lets you do what you need to repeatably and safely.

    • 3pt14159 13 years ago

      I've never screwed it up on a live database, but I do take about 5 mins, first reviewing the keys, the type, whether or not something can be null, checking to see if critical columns have

          select count(distinct column_name) having count(distinct column_name) > 1;
      

      To make sure that there isn't an underlying uniqueness assumption.

      Sure I could do it in 10 seconds and save myself 290 seconds (a 97% savings!) but then one day I'd have to scramble like crazy in the middle of the night trying to figure out what I screwed up for hours on end.

      I'm not saying don't build an admin tool, obviously those are needed for things like banning users, but just get in there and carefully fix the data if something is wrong.

      • mafro 13 years ago

        This. Back in the day when I was in more of an analyst role, I ended up /having/ to hack on the live DB frequently (reasons for this were myriad).

        1. Always, always make a backup just before the hack.

        2. Write a small set queries like 3pt14159's to check uniqueness and other pertinent properties.

        3. Write a SELECT query to show the data you are going to change.

        4. Borrow the WHERE clause from 3, and write your UPDATE statement.

        5. Run 4, and then run 3 again to see that you successfully fixed it.

        6. When it goes wrong, restore the backup from 1 :0

        • jonob7 13 years ago

          This is pretty much exactly how I do it.

          I still sometimes get that sinking feeling in the stomach that I have screwed something up, usually just after I hit the 'execute' button. And I really don't want to have to take the site down to run the restoration.

        • Swannie 13 years ago

          Is there any other way??? :)

          I thought everyone did this - well... for small datasets, skip the back, use a transaction. Rollback if your step 5 failed and try again.

        • gbog 13 years ago

          6. Should be ROLLBACK, a life saver. Works in postgres.

        • MartinCron 13 years ago

          This reminds me of a feature that I wish that database systems supported: Make it impossible to execute DELETE or UPDATE statements without a WHERE clause.

        • stusmith1977 13 years ago

          I'm going to add step 5b - save what SQL you executed (against what server, and for what reason), ideally in source control, as an audit trail.

          Otherwise, I end up having this conversation (which actually happened):

          Him: <Big Client> is having troubles! Features X, Y, and Z aren't working! Me: Hmm, has anything changed? It was all OK on Friday. Him: No, nothing's changed. Me: Really? Him: Well I ran a bunch of scripts on Saturday while I was visiting them. Me: OK, so what exactly did you run? Him: Just a bunch of scripts.

        • mkubler 13 years ago

          As a tip - Also backup your staging database and have all backups using something like Rsnapshot or maybe even in a version control system, something which does point in time backups.

          I learnt this after I inherited a project which had been written by some Romanians and it was pretty horrible. There was no MVC framework, it was a hacked together mess.

          Somehow the live site started using the staging database instead of the production database, both were on the same server. Every time we (the devs) pushed to staging a script would grab the latest version of the live database and overwrite (drop tables) the staging database. The assumption being that the staging database is a bit like a demo server, changes made to it are temporary and just for testing, but that it should look as similar to the main website (but updated) as possible. The production database was backed up in about 5 different ways, but the staging database wasn't backed up at all.

          After about a week of vanishing books, books which authors had uploaded to the self publishing portable with descriptions and other information, we realised what was wrong. Their files stayed but their accounts and book details were wiped.

          In another epic fail on the same server I later moved the root folders by running the following as root (I'd probably have been stupid and run the same command if not as root but I'd have put sudo in front of it). > cd /home/<username>/public_html/public_html > mv /* ../

          I was meant to mv ./* (files from the current directory into one below cause they'd been copied across into the wrong folder. Needless to say moving the root folders such as /etc and especially /lib and /bin is a BAD idea. Although is fixable, but that's another story.

        • grey-area 13 years ago

          Don't you have a dev db somewhere that you can replicate the live db to? Time spent setting that up will be more than repaid by the time and stress saved when you have to do a quick fix - you can simply run your changes, check it all works on your replicated site, and then make the changes on your live db (preferably with some sort of migration tool which applies the same sql and backs up first). If you have a regular backup process you could tie into that to populate the dev database.

          Even if you can't replicate the entire live db, if you can automate backup, deployment of changes and test first elsewhere it makes the entire process far less fraught.

        • furyg3 13 years ago

          Maybe I'm old school, but shouldn't this be done in an dev or acceptance environment?

          I hack on the "live" DB every day, and by live I mean i sync this DB to another environment, try it out, run it on prod.

        • gav 13 years ago

          One of the things I prefer to do is to only write UPDATE statements that update a single row. For example instead of:

          UPDATE line_items SET quantity = 1 WHERE quantity < 1;

          I'd script the following updates:

          UPDATE line_items SET quantity = 1 WHERE quantity < 1 AND id = 123;

          For each of the individual rows that needed to be changed. Then I have a check that I'm really updating just the rows I expect, this is especially important to me where the UPDATE involves joins, as I find this is the trickiest to get right.

  • philsnow 13 years ago

    s/anime/animus/ ? or is this a new usage of "anime"?

    • lifeisstillgood 13 years ago

      s/anime/anima - as in soul, vitality

      (Not so much Jung's inner woman)

      I think the sentence does read better if it is complaining there are not enough cyberpunk Japanese comics on my site though :-)

      • lalc 13 years ago

        Here I was thinking you were referring to the Japanese meme "No ___, No Life!"

dasil003 13 years ago

I'm not sure whether it's terrifying or relieving to realize that if all I dream of comes to pass and I achieve something akin to the legendary status of pg in the hacker community that I will still be susceptible to the inevitable facepalm moments that come with direct database access.

In any case I am thankful for the detailed explanation.

  • larrys 13 years ago

    Some of the most spectacular airplane crashes are by the most experienced pilots.

    If you've ever tried something new as a hobby you tend to be very careful. Once you gain confidence you take more chances and don't do what even a beginner might do.

    • stcredzero 13 years ago

      There must be a rare personality type that never experiences this kind of overconfidence. Perhaps a less glamorous cousin to the Buddhist beginner's mind?

      • larrys 13 years ago

        I find that if I am doing something "dangerous" (example might be using power tools) I have to say to my self "be careful this is dangerous" to avoid being on autopilot and making casual errors. Maybe a better example is the way you train yourself after you've picked up a box the wrong way and pull something to try to remember each and every time to watch your specific movements.

        • dasil003 13 years ago

          I have to remind myself this every time I start up Sequel Pro now since in the last release they switched the command keys for Run Selected... and Run All...

        • kamjam 13 years ago

          But at some point, you will become complacent, and it will taken a mistake to remind yourself again.

          We've all done it. I shut down an NT4 production server because I was connected via remote desktop and clicked shutdown rather than log off. This was back in the day when there was no pop-up asking for reason you want to shut down and confirmation.

          Luckily it was just our internal intranet server!

          • yuhong 13 years ago

            You were running NT4 Terminal Server Edition?

            • kamjam 13 years ago

              Yes, i think so. It was so long ago and it was my first programming role!

              • yuhong 13 years ago

                AFAIK that edition had logoff instead of shutdown on the Start menu for that reason. You can still access shutdown by hitting Ctrl-Alt-Del on the console or clicking Windows NT Security.

                • kamjam 13 years ago

                  It was so long ago I have no idea. I defo vividly remember that I shut it down via the start menu, just one of many moments that stick out :)

    • dasil003 13 years ago

      Too bad we rarely get a postmortem on batshit insane production hackery that actually goes off without a hitch.

      • shoden 13 years ago

        I would like to turn this into a poster or a t-shirt.

    • ansible 13 years ago

      I agree. I've been told that with motorcycle riding, the first 10K miles are the most dangerous. This is when you've gotten out of the newbie stage, but don't yet understand your own limits nor the bike's limits.

  • hnriot 13 years ago

    he's not legendary for his IT skills.

    • mark-r 13 years ago

      It was his IT skills that got him the big sale to Yahoo that got him the bucks to start YC. Not sure where this comment came from.

      • papsosouid 13 years ago

        You really think yahoo bought viaweb because of pg's legendary ability to reboot servers and type "./configure && make && sudo make install"?

  • Joeri 13 years ago

    The amusing part is that no matter how legendary you become, restarting the server is always a good idea to solve problems. Software is rarely designed to run forever. Last week i had a moment of madness because a line of code remained buggy even after i debugged it. Turned out it was php's opcode cache that just needed a reset to get its wits back.

  • Eliezer 13 years ago

    I thought it was a pretty cool error. I mean, you've got to not screw up on a lot of boring things before you can screw up this interestingly. Most failures are much more boring.

sehugg 13 years ago

Great postmortem and good lessons to learn here:

* Don't manually modify database without a well-tested procedure and another pair of eyes

* Don't leave persistent problems (e.g. memory problems) uninvestigated so that you miss new problems with similar symptoms

* Don't push new code to production while operational problem is ongoing (unless it addresses the operational problem)

I'm pretty sure I've repeated this exact same sequence before with similar results...

  • scotty79 13 years ago

    * while you are displaying a tree keep track of the items you already displayed so you can detect a cycle

    • sukuriant 13 years ago

      I think the assumption there was that it was safe, since the code disallowed this from happening, naturally.

      • scotty79 13 years ago

        Assumption is the mother of all screw ups.

        Even if you think that the code that creates and modifies your data will not put it in some undesired state, the code that uses this data should assume that the data may be in all undesired states you can dream up and should do its best not to do something seriously bad when that happens (like landing in endless loop/recursion or executing possibly user provided strings).

  • JonM 13 years ago

    * Don't push new code to production while operational problem is ongoing (unless it addresses the operational problem) ^^ absolutely!

  • zzzeek 13 years ago

    use CHECK constraints to prevent invalid data patterns when possible.

    • erichocean 13 years ago

      Sadly, even that isn't enough.

      In our production database, I used CHECK constraints religiously. Worked great.

      Then one day, I was no longer able to commit ANY transactions to a particular table, even completely innocuous ones.

      The problem? The database itself had violated its own CHECK constraint on a previous commit, but was enforcing it on all subsequent commits, causing them to fail. Brilliant.

      Moral: not even CHECK constraints will save you.

      ----

      P.s. This was a proprietary database, and when I reported the problem to the vendor (eventually, I figured out how to reproduce it), the vendor actually refunded our (expensive) support contract rather than fix the bug -- they couldn't figure out how to fix it despite having a small bug report that reproduced the problem.

      In the end, I actually had to remove the CHECK constraint altogether. :(

      • zzzeek 13 years ago

        > The problem? The database itself had violated its own CHECK constraint on a previous commit, but was enforcing it on all subsequent commits, causing them to fail. Brilliant.

        I've never heard of that, and unless you're using a really buggy, broken database, it should not be possible.

        > P.s. This was a proprietary database, and when I reported the problem to the vendor

        well there you go. I think in practice, a simple CHECK constraint like the one we'd do here (literally, comment_id > parent_comment_id) is pretty easy to put one's faith into.

    • papsosouid 13 years ago

      That's hard to do when you are afraid of databases and just store everything in files.

neurotech1 13 years ago

This should serve as a example template for how to accurately and transparently explain to users what went wrong. No deflecting blame, no useless platitudes.

Credit to PG, RTM and the rest of the team for keeping the sites uptime as high at it is.

  • jgrahamc 13 years ago

    "No deflecting blame"

    Who were they going to blame?

    • wpietri 13 years ago

      He could have blamed the new server. Or whatever distracted him. Or the user, for being dumb. I've seen people do all of those.

      Or he could have just dodged the blame entirely.

      • dguaraglia 13 years ago

        Some people are just freaking difficult to work with. I've worked with people that wouldn't accept responsibility even after every other possible cause was ruled out. I've even gotten this reply: "well, you must have been unlucky to get the faulty e-mail, because it seems to work most of the time". Yeah, because that's how programming works: cowboy coding and hoping for the best.

        This guy actually called bugfixes "optimizations": "hey, X the feedback widget on the front page isn't working", "oh, yeah, I haven't worked on that because that's code that needs to be 'optimized', so it's low on my issues list". Ugh.

        I've learned my lesson now. In fact, that's an incredible lesson for a startup founder: never, ever, hire someone who dodges a question on an interview. And the first time they avoid taking responsibility for something that was clearly their fault, fire them. The last thing you want is someone who'll blame everyone and anything else for their issues. It's a great way to kill morale and create rifts in a small team.

        • artursapek 13 years ago

          pg has an essay where he says the smartest people he knows are always willing to take blame or admit they don't know the answer to a question.

        • paganel 13 years ago

          > And the first time they avoid taking responsibility for something that was clearly their fault, fire them.

          I guess there's still difficult for a lot of people to acknowledge their own mistakes, maybe because they're afraid of getting fired for that (acknowledging the mistake), which in the case of startups/small companies happens very rarely.

          From my own experience of working at startups for my entire professional career as a programmer (7 and a half years) I can tell you that the first step when noticing you f.cked something up is to take immediate responsibility and then asking yourself "how can I/we fix this?" (you might need the help of other people to fix your mistake). After you've fixed the issue the question should be "how can we make so that this doesn't happen again?". That being solved I'd say nobody cares anymore whose fault was it to begin with, there's always other more important stuff to do.

          I agree that maybe at larger companies this kind of thing might happen exactly the opposite way, i.e. you can get fired for making a mistake and nobody really cares to fix other people's stuff, because their next paycheck/financial well-being does not depend on that (or so they think).

          • dguaraglia 13 years ago

            Exactly! When everyone is on the same boat the priority is fixing the stuff, then wondering whether attributing responsibility is important (most of the time it isn't. Who cares who fucked up the e-mail template, as long as it's fixed next time it runs.)

            I think in this guy's case the causes for his reluctance (or rather, incapability) to accept his own mistakes had much deeper roots. He was literally the most self-centered person I've ever met, to the point that he wouldn't accept anyone's opinion on anything. He went out of his way to find doctors that'd go with his suggestions and run all kinds of tests on him to determine why he had a blood pressure problem, when he was clearly way overweight and had the unhealthiest diet I've ever seen. He'd dress up in shorts and t-shirts during the worst days of winter, and then take a niacin pill to force a capillary rush so his hands wouldn't feel cold (?)

            Basically, he just thought the world had to bend to his will. Why use common sense, when you can just say "fuck it" and find a workaround that fits your mindset. Of course you can't expect someone like that to 'accept' his own shortcomings.

            The scariest part of all this is he tried, for a while, to become a cop. Yep, imagine that: a 240lbs armed prick, completely unable to reason, forcing his way on everyone. I shudder at the thought.

            • lotyrin 13 years ago

              > imagine that: a 240lbs armed prick, completely unable to reason, forcing his way on everyone. I shudder at the thought.

              Where are you from where that's something you have to imagine, because I want to move there.

    • reddit_clone 13 years ago

      The user who replied incorrectly?

    • vertis 13 years ago

      Sequoia Capital or Andreessen Horowitz

  • smackfu 13 years ago

    I don't know, it's a lot easier to be transparent when the stakes are so low. Most service providers have a real incentive to not put out quotes that can later be used against them, which tends to make explanations very technical or deflecting.

larrys 13 years ago

"But then I decided to just fix it for him by doing some surgery in the repl."

I've always found it's a good idea to not deviate. Whether it be running, parking or anything else once you deviate from some regular behavior you run into potential problems that you hadn't anticipated.

"For some reason I didn't check the comments after the surgery to see if they were in the right place. "

More or less my point. If this wasn't a deviation from normal behavior you would have "checked the comments after the surgery" because it would have either become habit or the shear number of times you tried a fix resulting in an error would have made that more likely to occur.

  • irahul 13 years ago

    > I've always found it's a good idea to not deviate.

    Aren't you assuming "surgery in repl" is a deviation? What if it's normal course of action for him?

    > More or less my point. If this wasn't a deviation from normal behavior you would have "checked the comments after the surgery" because it would have either become habit or the shear number of times you tried a fix resulting in an error would have made that more likely to occur.

    How about the opposite scenario? He has done it so many times with desired results, that he didn't bother checking?

    • badgar 13 years ago

      > Aren't you assuming "surgery in repl" is a deviation? What if it's normal course of action for him?

      This is a big difference between engineering and hacking. An engineer would never regularly do something so dangerous.

      But I suspect pg isn't an engineer when he works on HN, I suspect he is a hacker, and just does whatever he wants to, whenever he wants. Which is his prerogative.

  • prawks 13 years ago

    > I've always found it's a good idea to not deviate.

    > you run into potential problems that you hadn't anticipated.

    The second statement is no reason to live by the first. In fact, I think you'd be doing yourself a disservice by staying so comfortable. Being comfortable with the unanticipated, however, is a powerful quality to have.

    • rwallace 13 years ago

      That's fine, but try to become comfortable with the unanticipated on a test server, not the production server.

  • scotty79 13 years ago

    If you don't deviate from what you usually do you don't learn.

    Obviously don't deviate from routine (or rather prescribed procedure) when you are running nuclear power plant or airplane maintenance. But when tinkering with the site that gives you no money and won't cost any lives you can loosen up a bit.

tolmasky 13 years ago

Why do "self posts" like this show up in the same light gray as posts with negative vote counts? My eyes aren't great and I find it hard to read

  • hayksaakian 13 years ago

    Maybe post color is based on some get_text_post_color method that applies to self posts and comments, where the color depends on comment karma. Given that self posts like the OP are votable as if they were a normal link post, their comment karma value is probably 0.

  • emillon 13 years ago

    The rationale for this is that if you need to post a long text post, it should be in the form of a blog post instead. I agree with you that it's not really adapted for a meta post.

  • vl 13 years ago

    I'm using "Hacker News Enhancement Suite" Chrome extension - it fixes multiple problems, including this one.

irahul 13 years ago

Disclaimer: Hindsight is 20/20, and stuff.

If reverting code didn't fix it, reverting server didn't fix it, incorrect data is the most likely culprit(I am not claiming this should have outright occurred to you; just thinking out loud). I take it you introduced non terminating recursion by making a thread its own parent, and you made the change on disk.

But this analysis is the last thing that comes to mind when you already have introduced 2 new variables the same day - new code, new server. And an old, recurring variable(GCing too much) is in play as well.

benatkin 13 years ago

So what do you do to avoid this in the future? Do you stop doing surgery in the repl, or do you do the surgery with functions that check for cycles from now on?

  • badgar 13 years ago

    > So what do you do to avoid this in the future?

    It's HN... there's no SLA, there's no postmortems, there's no doing things better in the future. pg just runs this site out of the good of his heart, we should be lucky the volunteers run it for us at all.

    • larrys 13 years ago

      "pg just runs this site out of the good of his heart"

      Don't think it's a "good of his heart situation". HN provides a benefit to YC and YC companies and attracts people to YC. As another example Fred Wilson has a very popular blog AVC and has said many times that he considers it "his secret weapon" (or something like that) because the value it provides over his competition.

    • benatkin 13 years ago

      > It's HN... there's no SLA, there's no postmortems

      I didn't mean to imply that there were. I was just curious.

      > there's no doing things better in the future. pg just runs this site out of the good of his heart, we should be lucky the volunteers run it for us at all.

      Since there are multiple volunteers, I think that the site always feels important to at least one of them. I imagine that some of them have gone more than a week without giving a shit about HN, but not all of them at once. So I think there is doing things better in the future. In fact, HN keeps getting improvements behind the scenes, to keep it running, keep it interesting, and keep it from getting overrun with trolls.

    • martinced 13 years ago

      I don't think parent did mean that he had a "right" to expect some level of quality or anything.

      It's just that we, as programmers, tend to take measures so that silly bugs do not happen anymore or that, at least, we leave big clues as to what went wrong.

      In a project I had a similar issue: I was wrapping lists inside immutable lists but, due to a silly bug, I kept wrapping immutable lists inside immutable lists at every save made. So saved files would grow bigger and bigger.

      And I did fix the bug and also added a big fat warning logs in case too many nested lists were detected.

      pg might just as well have now added something preventing infinite recursion inside the comment tree or some WARN logging telling when a generate page is getting too big, etc.

      I'd still find it very interesting to know what pg did, if any, to dodge / minimize / make it easier to determine if such an issue happens in the future.

    • jlgreco 13 years ago

      > there's no postmortems

      1. Press [Home] key.

      2. Read postmortem.

      3. ???

    • oh_sigh 13 years ago

      > pg just runs this site out of the good of his heart

      Hilarious. I would have believed you if you appended "and his wallet"

      • badgar 13 years ago

        One core. One HD. Bandwidth is trivial with no images. How much do you think this site costs to run?

        • jlgreco 13 years ago

          He means that PG runs the sight because it makes business sense, not out of charity.

        • benatkin 13 years ago

          How much do you think the time of YC partners is worth?

        • gknoy 13 years ago

          His time. Maybe the core and HD are Very Nice ones, too, of course.

        • oh_sigh 13 years ago

          You didn't understand what I said. I was implying that this site is a money-maker for pg, not that it costs him money.

  • ww520 13 years ago

    This reminds me of the countless conversation I had with people after a crisis. What can we do to prevent this from happening again? What process can we put in place? What restriction needed to be tightened up?

    And that's how processes are born.

    • tantalor 13 years ago

      > And that's how processes are born.

      Not necessarily. Processes are implemented by people, so they can break at any time.

      The correct solution is more code, or less bad code.

Legion 13 years ago

"We'll do it live!"

  • jbuzbee 13 years ago

    "Hey, Hold my beer and watch this!"

gruseom 13 years ago

This is a particularly endearing piece of "hacker news". It's so easy to relate to.

lucb1e 13 years ago

Are you saying you manually modify the database? Like, shifting around things by id instead of just making admin buttons next to posts?

  • RKoutnik 13 years ago

    Sometimes that's easier (albeit more dangerous, as we just saw).

  • sgt 13 years ago

    I think I get what you're hinting at.

    Ok, so this is Hacker News, it's in the name, and most of us are aware that HN is also a research/hobby project. It's not made to be an rock-stable enterprise system doing bank transactions or what not, so I think what pg did was prefectly excusable. People make mistakes. Nobody will die without HN for a day or two, and it won't affect the site's popularity one bit.

    • lucb1e 13 years ago

      No that's right, but I worry about apache2 being down for potentially one or two users or bots that visit/crawl my website during a one-minute reboot. Meanwhile the big boys are down for 16 hours because they do things that any other person would have gotten a decent scolding for. Just look at the points per hour this thread is getting, if I had posted this about my website on my website people would have said I was stupid.

      You are right though, making mistakes is human as they say, and nobody dies because of this. In fact, less popularity might be good for the site's content quality. I'm just surprised by how much they care about thousands of hourly users, that what I would dream of having.

      • RyanZAG 13 years ago

        I'd say this is a pretty important lesson: you don't need flawless technology and zero downtime to be popular and/or profitable. You need content worth viewing. People are more than willing to put up with technical errors if it's something they want/need.

        Focus on providing people what they want/need, and don't worry so much about having flawless technology until you can employ a horde of PHDs.

        • artursapek 13 years ago

          I would say that's a observation you just made there.

      • chc 13 years ago

        Patrick McKenzie had a great horror story on his blog a couple of years back. He runs a service that provides appointment reminders to businesses' clients (e.g. "Don't forget, you have an appointment to get your hair colored at Best Little Hair House tomorrow at 3"). Long story short, an attempt to manually correct a hangup in the live system resulted in his product spamming his customers' clients (that's right — not just his customers, but their customers) with up to 40 phone calls back-to-back.

        So, how many customers do you think he lost because of this? The answer is two, and one of them signed back up because they were impressed by the great job he did in handling the fiasco.

        Moral of the story: As long as you really are making your best effort, you might be surprised how willing people are to deal with human error. Yes, they might be be mad, but a mistake is (usually) not the end of the world.

      • alex_c 13 years ago

        >Just look at the points per hour this thread is getting, if I had posted this about my website on my website people would have said I was stupid.

        For what it's worth, I upvoted this thread specifically because we've all done something this stupid (or worse) :)

  • irahul 13 years ago

    HN runs on plain files. He wasn't modifying database, but calling functions(I believe) in the repl to change the parent id of the thread.

    But that apart, even if there were an admin button to change the parent id of a thread, he would still have made the same mistake.

    Unless the code in question was checking for loops. In that case, repl would have worked the same.

    • lucb1e 13 years ago

      I sort of meant that you shouldn't modify things like that directly. Be it a filesystem, database, or any other place that makes it possible to mess things up to bring a rather strong server down.

      • triplepoint217 13 years ago

        Yes, I was trying to make posts editable on the HN instance I run, so I got clever and started messing with the files in emacs. Then I learned that the HN code does not like files in the story directory with ~ on the end of their name (emacs backup files), oops ;).

      • irahul 13 years ago

        I see where you are coming from. But I am saying this didn't happen because he did things live. This happened because he entered incorrect id making a thread its own parent(or grandparent; doesn't matter).

        This is the kind of mistake one would make even if you were writing proper migrations. He was doing things live isn't an issue; neither is an incorrect id. The issue is the code doesn't check for loops.

        • chernevik 13 years ago

          I am but an egg, I have two questions.

          One, if the data were held in a database, should a change like this be captured in the database logs? I am seeing more and more situations where I want these, I notice that they are by default turned off for mysql and wonder if this reflects a de facto judgment that logging slows performance more than is usually worthwhile.

          Two, if the data were kept in a database, wouldn't something like this be prevented by a constraint preventing a comment from making itself an ancestor? But I suppose there is a slight performance hit in checking such constraints, and the case arises so rarely that this hit isn't generally worthwhile.

          • lmm 13 years ago

            Databases, at least the SQL kind, really aren't good at dealing with hierarchical data, and I don't know how you'd even begin to express that kind of constraint. I don't think a traditional database is the answer here. (If it were me, once I'd done it more than twice I'd write a "move thread" admin tool in the UI, and after I screwed it up like this I'd have a place to add such a check to).

            • mr_luc 13 years ago

              If you were using some kind of representation for Nested Sets -- left-to-right depth-first numbering, or a human-readable id.id.id chain -- then it's really easy to write a constraint for that: parent left < myleft, right > myright, or dotted_id.split('.').filter{|first, rest| return false if rest.contains first} (yeah, yeah, that second pseudocode would be unrealistically PITA for some DBs).

              More generally:

              I'm not a big SQL wonk anymore, but I find a lot of people have the intuition that relational databases are ill-suited for trees.

              An intuition that is much closer to the truth is that almost all databases can handle trees pretty well, because there's still an unambiguous concept of ordering and containment, and you can usually arrange things so as to do range/ancestor/inclusion queries efficiently.

              It's graphs with loops/without unambiguous concept of ordering/containment that are really hard.

              • EEGuy 13 years ago

                Found this: The excellent Postgres documentation includes an SQL graph search with two different ways of graph cycle checking, here: http://www.postgresql.org/docs/9.0/static/queries-with.html

                One way involves accumulating an array of nodes already visited as the tree gets walked, checking each node as-visited for membership in the array-to-date.

                The other method, a bit more of a hack, is just adding a LIMIT clause.

                I think the 'WITH' clause is a great addition to the SQL standard, very much worth the learning the weirdness of its syntax and its optional 'RECURSIVE' term (which, as the Postgres documentation points out, isn't really recursion, it's iteration).

            • irahul 13 years ago

              I think if you want the family tree, you can write a self referential(assuming post table is self referential as it should be) recursive query.

              But in this case, writing a before insert/update trigger which ensure some_post.created_at < parent.created_at before setting parent.parent_id = some_post will do the trick.

          • irahul 13 years ago

            > I notice that they are by default turned off for mysql and wonder if this reflects a de facto judgment that logging slows performance more than is usually worthwhile.

            I think it's more like your application is doing the logging already(probably; most of the frameworks do). If you really need it, turn it on yourself.

            > Two, if the data were kept in a database, wouldn't something like this be prevented by a constraint preventing a comment from making itself an ancestor?

            Copy pasting the table from another comment.

                create table post (id int primary_key, parent_id int references post(id), child_id int references post(id), created_at timestamp)
            

            There isn't a simple check constraint you can place to ensure a parent's, or a grand-parent's, or a grand-grand-parent's parent_id isn't child.id You will have to write a trigger.

            This isn't really a big problem to solve. pg simply overlooked this problem. Had he not, he would have checked child.created_at > parent.created_at in his mutator method. So, when you do a post.parent = some_post(assuming mutator is parent=; replace it with post.setParent or (send post set-parent some-post) or whatever), it checks if post.created_at > some_post.created_at, and then assigns post.parent_id = some_post.id

    • gummadi 13 years ago

      Glad that you bought this up. If you have more knowledge regarding this, can you please explain how exactly the posts & nested comments are stored directly using flat files. How are concurrency issues handled?

  • martinced 13 years ago

    "Are you saying you manually modify the database?"

    Oh manually modifying production database on the fly ain't unheard of.

    However it's still not "very Chuck Norris" on a scale of Chuck Norrisness compared to the modification of a running app directly in the REPL. I mean: it doesn't matter if you manually modify the DB itself or not when you directly modify the app from the REPL itself (the app being anyway "in charge" of the DB).

    Sure, modifying manually the production DB might be an issue to some. But I can guarantee you that it's the last of your worries when you're actually modifying production code directly from the REPL ; )

birken 13 years ago

Do you have munin monitoring on the production HN server?

That would really make situations like this easier to debug. First, it can pinpoint exactly when something started happening, which in this situation might have helped you realize the problem was caused by your change. Secondly, in this specific situation it probably would have been easier to differentiate a situation where you are running low on memory vs this completely different situation.

As somebody who spent a lot of time professionally debugging large software systems when they were misbehaving (as a Google SRE), I can tell you that looking at graphs of many key metrics (disk IO, CPU, memory, then application specific things) was always the place to start when debugging a situation, because you can learn so many things right away. When did it start? Was it a slow buildup or an immediate thing? What is the general problem (Memory?, Disk IO?, CPU?, none of the above?)? Has a similar pattern happened in the past?

Then you can start to get fancy and plot things like "messages/minute" or something and then it becomes easy to see when issues are affecting the site performance and when they aren't.

  • stcredzero 13 years ago

    That and something like the Smalltalk Change Log would have made this a no-brainer debug. (Yes, every REPL action in Smalltalk got logged by the same mechanism that logged every code change.) Such mechanisms aren't trivial, but they're not rocket science either, and they have tremendous ROI.

znowi 13 years ago

I wonder what exactly did distract you :) When I do surgery on a production server, I triple-check making sure everything works properly.

I have two assumptions: 1. HN has a low priority in the overall scheme of things, 2. Self-confidence overflow :)

nowarninglabel 13 years ago

Happens to a lot of us. Great reason to always write tested cleanup scripts for this stuff instead of editing directly on the server. The only time I brought down my product last year was from a similar screwup, I was removing users by hand and somehow managed to end up with a 0 in my list of user ids, thus deleting the anonymous user, and causing havoc to my server, which took a long time to track down.

dap 13 years ago

Thanks for the detailed explanation.

It sounds like everything was done to fix the problem except try to figure out what the problem actually was. Why not use tools to see what the program is doing, form a hypothesis, gather data to confirm or reject the hypothesis, repeat until cause found, and then take corrective action that by this point you have high confidence will work?

I realize HN is more of a side project than a production service, but the goal is the same in both cases: to restore service quickly so you can move on to other things. It feels like a more rigorous approach would allow restoring service much faster than randomly guessing about what could be wrong and applying (costly) corrective action to see if it helps.

Besides that, in many cases (including this one), you cannot randomly guess the appropriate corrective action without finding the root cause.

luser001 13 years ago

I use assertions to protect against things like this.

I liberally sprinkle my code with assertions (CS theory calls them pre-conditions and post-conditions, iirc) to crash early if the system is an invalid state.

One my pet peeves is that few programmers seem to love assertions like I do. Would love to see to comments on this.

  • swah 13 years ago

    The kind of assertion he needed though, could only be ensured by the database, not application code (my impression).

    • luser001 13 years ago

      Agreed, infinite loops are a little hard to protect using asserts.

      When I hit the first infinite loop bug on a code path, I frequently add code to assert that the number of calls is less than $A_LARGE_NUMBER to catch future occurrences of the same root cause.

      • badgar 13 years ago

        > Agreed, infinite loops are a little hard to protect using asserts.

            assert(is_tree(comment_graph))
        

        Typically, a composite entity (like an "item" on HN which has many "comments") will define invariants to ensure data integrity. In this case, the invariant is that an "item"'s comments form a tree.

        The database layer often contains this logic, but it depends on how you're building your application; NoSQL backends for example typically must put validation in the application layer. Since HN just uses files, a well-developed application layer should be riddled with invariants like this.

      • swah 13 years ago

        This is similar to the "while with timeout" that is common in embedded code (of course, watchdogs are better...)

    • irahul 13 years ago

      The kind of assertion he needed could not be ensured by the database. The kind of assertion he needed was there are no cycles in the graph. How would you ensure that in a database?

      Also, HN uses flat files, not database.

      • swah 13 years ago

        I was thinking something like (supposing the comments were stored as "closure tables" like Karwin suggests):

          CREATE TABLE comment_tree (
           ascestor_id REFERENCES comments(id) NOT NULL,
           descendant_id REFERENCES comments(id) NOT NULL,
           CHECK ( ascestor_id <> descendant_id )
          )
        

        but I'm probably overlooking something. (I'm aware that HN uses flat files, I was just making a counter-point to the "simple assert" solution...)

        • irahul 13 years ago

          That will prevent a child being its own parent. It won't work for more than one level i.e a post being its own grandchild. Assume (post_id, parent_id) sequence: (1, 3) -> (2, 1) -> (3, 2).

          • zzzeek 13 years ago

            you can assert that "post_id > parent_id", assuming comments are always created subsequent to the creation of their parents (as is the case here) and that integer identifiers are always increasing (otherwise use timestamps). (1, 3) above would indicate an invalid case (not necessarily a cycle, but a precondition for one).

          • swah 13 years ago

            Please note that the "Closure Table solution involves storing all paths through the tree, not just those with a direct parent-child relationship."

            • irahul 13 years ago

              My bad. I was speed reading, and didn't read the "Closure Table" part.

      • Someone 13 years ago

        A constraint on the parent-child link table "Child creation time stamp > Parent creation timestamp" would do it.

        Might not be a bad idea, if the site were to have the two requirements "maintenance must be done on the live site from a repl" and "5 nines availability".

        • irahul 13 years ago

          How are you modelling your data? I think this should be a self reference.

              create table post (id int primary_key, parent_id int references post(id), child_id int references post(id), created_at timestamp)
          

          How will you place the check constraint? You only have parent_id and child_id, not parent and child entities. You will have to write a trigger.

          I am not saying this can't or shouldn't be done. I am saying a db won't directly solve it.

          However, you example will work perfectly for enforcing constraints in the code via the mutator which can compare child and parent timestamps, provided pg was doing it via a mutator, and not directly changing the ids.

  • timothya 13 years ago

    What assertion would you have used in this case? For every comment you'd have to iterate through all it's parents to check if there is a cycle, which seems pretty inefficient to do for something that should never happen (there are other ways that you could check for this problem as you go, but the only other ways that I can think of require holding extra state just in order to perform the assertion).

    I'm for assertions when they are simple and don't cost much (especially during development), but it's not feasible to check every condition that should not happen.

    • petercooper 13 years ago

      You could assert a limit on depth, perhaps. Then the cycle would still exist but after X number of comments, the rendering ends.

      • timothya 13 years ago

        This is a reasonable solution. While it will (almost) never provide the correct result (it might print out a cycle of comments until X is reached, or it might cut off a very long but legitimate comment thread), it would provide a reasonable guarantee on this sort of problem not generating infinite pages.

        • petercooper 13 years ago

          At the risk of being accused of flame-baiting, I'd say it's the engineering solution rather than the mathematical one.. ;-)

          For some reason I tend to be a fan of the "stick it in a secure box" rather than "get it right in the first place" approach..

    • badgar 13 years ago

      >iterate through all it's parents to check if there is a cycle,which seems pretty inefficient to do for something that should never happen

      The number of parents is almost always under 3 or 4 and never over 100. Writes occur a few times a second at peak. You are prematurely optimizing.

    • zzzeek 13 years ago

      typically, if you're operating upon a particular comment, you've gotten there by traversing to it from the parent. Ensuring that traversals don't encounter cycles is easy, keep hold of a hashtable (or a set) of comment ids as you traverse. As the traversal encounters a comment, its id is added to the hash, and as you complete traversal of each comment, the id is removed. If you encounter an id that's already in the set, assertion failed - or better yet, log the condition and then cease the traversal. That way everything keeps running and the error is visible in the logs.

      If the code is organized (as it should be) such that all functions which require traversal of hierarchical comments pull this from a single function, then the hash check only need be applied in that one place in the code, where it need not be visible anywhere else.

d0m 13 years ago

Hacking code in the repl without testing the new behavior. We all did that. Don't lie. Once I wanted to quick fix a "gmail.ca" to "gmail.com", which I did.. but to all the users instead of just the one mistaken. Fortunately I realized by mistake really fast ;-)

Uchikoma 13 years ago

Appreciating the details.

"Hacker News was down all last night."

With the internet there is no "last night" ;-) Europe - and more so Asia I assume - had to live for many working hours without HN.

  • pramodliv1 13 years ago

    Yeah, I was more productive yesterday. But I did read google cached versions of HN.

scotthtaylor 13 years ago

PG, quick question: Did this impact the server hosting the YC Summer 2013 applications?

When I tried to edit mine, it simply said "Thanks, scotthtaylor"

fnordfnordfnord 13 years ago

>>I caused an item to be its own grandchild.

Please forgive me. I know you folks tend to hate jokes on here. Don't waste your time if you're immune to corny humor. "I'm My Own Grandpa- Ray Stevens" ( with family tree diagram) http://www.youtube.com/watch?v=eYlJH81dSiw

robomartin 13 years ago

Great story! Yup, this kind of thing happens. For some reason it reminded me of something that happened to me as a newbie engineer. It was really funny a week later.

I was troubleshooting an intermittent problem in a piece of equipment. It had several boards full of mostly LS TTL logic chips (yes, them chips). It was the kind of problem that only happened once every other day or two. Nobody knew. So, I had all kinds of instruments attached to this thing and was watching it like a hawk waiting for a failure. It had probes attached to every point in the circuit where I suspected I could see something and learn about the source of the problem. I also tested for thermal issues with heat guns and freeze sprays, familiar troubleshooting techniques to anyone who's done this kind of thing.

Anyhow, every so often the thing would go nuts. The three scopes I had connected to it showed things I simply didn't understand. I'd analyze but couldn't make any sense out of it. Still, again, every so many days it would happen again. Changed power supplies and the usual suspects. No difference.

Well, finally, two weeks later, the other engineers in the office took pity on me and told me what was going on: They had connected a VARIAC to the power strip I was using to power the UUT (unit under test). The scopes and other test instruments remained on clean power. Every so often they'd reach into this drawer where the VARIAC was hidden and lower my power strip's voltage just enough for the power supplies to fall out of regulation and everything start beeping and sputtering. Those friggin SOB's. They had me going for days! I was pissed beyond recognition. Of course, after a while I was laughing my ass off alongside them. Good joke. Cruel, but good.

My revenge: A CO2 fire extinguisher rigged to go off into his crotch when my buddy sat down to work.

Fun place to work. We did this kind of stuff all the time. Today I'd be afraid of getting sued. People have really thin skins these days.

DanI-S 13 years ago

n.b. that this is why time travel is a terrible idea.

  • cmaggard 13 years ago

    The grandfather paradox is the best solution to the halting problem.

  • w_t_payne 13 years ago

    Because the computers that run the Matrix will get overloaded?

louischatriot 13 years ago

Funny to see that this happens to everyone. A week ago, while testing some stuff to locate a low-importance bug, I erased the whole user database. Fortunately we have a good restore so the problem was solved in a few minutes, but still, cold sweat here ...

neilxdsouza 13 years ago

Isn't it curious that the comet incident over Russia happened so close to the pass of DA 14. In the intro to the book:

http://ruby.bastardsbook.com/about/#why

is the note about surgical instruments left inside. It seems just like a coincidence that this happened so close to the switch to the new server, but I wonder if it's something deeper in the subconscious mind; the change to the new server is quite a big change (I know I feel that way when I have purchased a new computer (it feels different - even if it's running the same linux as before)) and could have upset the normal checks one has in place when tweaking things.

cool-RR 13 years ago

Great debugging story!

I guess the lesson is to have code that alerts you about comment loops without going into an infinite loop.

Also another lesson would be to figure out a way to have better clarity into which requests are causing a timeout on the server.

raheemm 13 years ago

I'm curious how was RTM able to notice that the problem seemed related to a specific item id? It would be great if he might write a short blurb similar to yours. Which also makes me wonder, why does RTM not write much?

Posibyte 13 years ago

I absolutely love post-mortems like this. It clearly identified that there was a problem, what the author tried to do to fix it, and if it was successful. Even if it ends with the author not knowing too much about the solution that was used, it's still so interesting to see the workflows and be able to derive something from it.

It's also why I like to read pg's articles so much. They're so in-depth and detailed and it doesn't feel you left thinking something was left out for the sake of being hidden.

rnadna 13 years ago

I fall into a similar misdirected-focus trap, but mine is simpler: I waste an embarrassing amount of time in editing the wrong damned file. After a sequence of small tweaks that yield no change in the results, I make a huge change and see nothing, and then realize that I've done it yet again. I need to write a vim macro that blanks the screen every few minutes, displaying the message "are you SURE this is the right file?"

mikedmiked 13 years ago

> created a loop in the comment tree; I caused an item to be its own grandchild.

Ah, the online forum equivalent of going back in time to kill your grandfather.

RKoutnik 13 years ago

It's nice to know that even the mightiest of us can still make mistakes. Thanks for being willing to admit mistakes so the rest of us can learn.

johnobrien1010 13 years ago

Thanks for fixing it.

Have you considered avoiding dipping into the repl to do these kind of fixes? You don't owe any of us any sort of uptime guarantee, and you're a much better programmer than I, but it strikes me as odd that you would hack against the live server instead of create some tool that would make it so you couldn't take down the whole site when making this kind of fix...

sideproject 13 years ago

Thank goodness it's back. I lost the my meaning of existence for the entire day. I don't know where my yesterday went. I'm ok now. :)

corwinstephen 13 years ago

It's never what you think it is. One time, I had a memory leak in a Rails app that took me TWO WEEKS to find. In the end, it came down to me putting a line of config code in the wrong section of the config file, which for some reason created a recursive loop and caused my servers to crash about once every 30 minutes. #weak

xentronium 13 years ago

Whoa, what an unfortunate coincidence. This whole bug would be so much easier to find, if it weren't for the new server.

  • lmm 13 years ago

    The bugs that actually hit production are always like this - a confluence of three or so factors - because if it were simpler you'd have caught it earlier.

    (Though I have to say, upgrading the code at the same time as you're restarting to fix a problem is really a rookie mistake. It's incredibly tempting because it saves so much time, but if you do it you will get it wrong sooner or later. One of the hardest skills in programming is acquiring that zen that you need to wait in a state of readiness for the effects of your first change to make themselves apparent, rather than changing something else)

GnarfGnarf 13 years ago

That's funny -- I work in genealogy software, and loops ("being your own grandpa") happen all the time, due to data entry errors. To avoid infinite recursion, we always keep track of what records we've processed already, check whether "I've been there before", and bail out if the answer is affirmative.

cranklin 13 years ago

You are honest and I respect that. I'm sure many companies try to play off their downtime as something far more sophisticated when in fact, it was something too embarrassing to admit. I've certainly had my fair share of embarrassingly stupid mistakes that resulted in downtime.

richforrester 13 years ago

Cheers for that pg - now I have to explain to my boss why I was actually productive yesterday.

aaronh 13 years ago

My pet peeve: You made an arbitrary change while debugging a problem. NOW YOU HAVE N^2 PROBLEMS!

rjempson 13 years ago

That is why some organizations don't allow adhoc data fixes to be run in production. Best practice is to backup the database, run the fix against the backup, test the fix against the backup, and all being well run the fix against production.

T-zex 13 years ago

Thank you for the honest explanation. This is not so easy especially for a famous person.

bramcohen 13 years ago

You should probably make your code robust to this sort of data corruption in the future.

infoseckid 13 years ago

"I don't know if Nick succeeded, because in the middle of this I gave up and went to bed." - Not a good example to your holding companies :) What would happend if they all went to bed when something goes wrong :) Just kidding.

harrisreynolds 13 years ago

Just about anyone that has programmed for any length of time has done something like this. It is one of those "fixes" that after it's actually fixed you try to never think of it again. Good to know PG is mortal. :-)

ricardobeat 13 years ago

Related question: what is the timing for the 'Reply' link to show up? I might be fantasizing but sometimes it takes 5, sometimes 10 minutes to appear, leading people to reply as a sibling instead.

  • DanBC 13 years ago

    Deeply nested comments tend to be hot, and so the reply link takes a while to show up to try to give people some time to think about what they're going to say.

    • ricardobeat 13 years ago

      Interesting. So it slows down discussion as it progresses, until it either stops, is forgotten or becomes a series of long essays.

  • beagle3 13 years ago

    The deeper it is, the longer it takes to appear (I think at depth 7 or 8 it starts counting in hours, and at some point it just won't appear at all).

    Just like too many nested if(x) { if (y) { if (z) ... }} constructs, too deep a discussion nesting is also unreadable.

ibudiallo 13 years ago

When hacker news is down, I finally lifted my head and realized that there is life beyond the screen on my phone.

Now that it's back. I realized that it's finally time to create an account :)

nournia 13 years ago

It seems that in your new server and also latest pushed code, I can't do `like` anything. Honestly it's not a new bug and I got used to that, don't think about that.

btilly 13 years ago

That explains something weird I saw.

If I went to Google's cached copy, I could see threads, and then click on them. But the front page was down. But I could see individual threads.

Very confusing.

dennisgorelik 13 years ago

Did you add code that detects very deep nesting levels (e.g. depth more than 100) and throws meaningful exception to help developers to diagnose the problem?

hnriot 13 years ago

it's a good job it's your site, this type of thing is often what gets someone fired in a company. Modifying (meddling!) the production system directly.

  • lucb1e 13 years ago

    Upvoted. This is what I wanted to reply, but then thought better of it and moderated my response.

    • irahul 13 years ago

      I think you people already know the answer. The amount of freedom and stake/reward system for pg is different from yours.

      I can't speak for pg, but personally, I am not going to write a migration to re-parent a single thread if the site in question is my side project, doesn't bring revenue, has some intangible benefits, but not so much that warrant putting much labor into it.

      Either it would be `thread.parent = new_paret_id`; or if it occurred to me that it might introduce a loop, changing `parent=` to take loops into account followed by `thread.parent = new_parent_id`. What were you expecting? A bug tracker discussion, code commit, review, change request and deployment?

      • hnriot 13 years ago

        no, you just leave it alone, it would have sorted itself out if nothing had been done at all.

        failing that you put a cap in the code that generates the page. Simple stuff.

      • dap 13 years ago

        The problem is thinking of it as the cost of implementing the feature vs. doing manual surgery on the production database, without realizing that if you choose the latter, you're also choosing the risk that you'll spend hours debugging the system when the surgery goes wrong. It's a tradeoff, to be sure, but it's not clear that the latter is cheaper on expectation.

  • JohnBooty 13 years ago

    I disagree. Very few companies would think negatively of an engineer if they made such a mistake on a non-essential, non-revenue-generating fun/research project.

    How many dollars did YC lose because of the outage? None. (Maybe they saved a few on bandwidth!)

    I also predict that exactly zero startups will say, "Man... I'm not going to take seed money from those guys! They had discussion forum downtime."

    • hnriot 13 years ago

      They could save even more if they shut it down! That's a ridiculous thing to say. We could all save money that way.

      You've obviously never worked in a for profit corporation. in such there are policies and practices put in place to prevent just this kind of newbie mistake. You never modify the live database directly. Never ever. Whether it's a bottom line property or not.

      I didn't say it would negatively impact YC's business. It might make them look incompetent, but these things happen, but people don't approach YC for their website savvy, they go there for the money and the connections. Most of the VC firms i've EIR'd at have much worse IT than hn. Their sites are barely usable. It seems to just go with the territory.

      Let's not be so defensive, PG can do what he likes with his site, including take it down whenever he feels like saving bandwidth. But in the real world these kinds of things get real people on a fast track to their exit interview.

      • JohnBooty 13 years ago

        "You've obviously never worked in a for profit corporation"

        Wow, really? I don't think that attitude is warranted at all.

        At any company (for-profit or otherwise) there is a finite amount of time and money -- and surely we can agree that solid development/deployment practices carry an upfront time/money cost, can't we?

        In an ideal world, all projects would have continuous build processes, automated tests, and management tools extensive enough to render live database surgery unnecessary.

        Perhaps you've worked at companies so flush with cash that every single line of code, research project or otherwise, has gone through rigorous development/testing/deployment practices. If so, I'm jealous. I've always worked at companies that had to be choosey about how they spend their resources.

      • st0p 13 years ago

        I've worked at a lot of for profit business (not banks though, i can understand that in those kind of business requirements are diferrent) and made live database updates in all of them. In 99% of the cases all goes well and in the remaining 1% you need to revert to your backup (always make backups before doing anything!) and have a couple of minutes of downtime.

        It really depends on what kind of business you're in whether this is acceptable or not.

  • wpietri 13 years ago

    Dumb companies, maybe.

    The goal was reasonable. The action was reasonable. What are you firing somebody for? Making mistakes? Good luck making that a hiring criterion. "Ok, tell us about a time you made a mistake and what you learned from it. What's that? You never have made one? Great, you're hired!"

    The solution from a retrospective should never be, "Let's make people more scared to do the right thing." Or "Let's fire people with bad luck." Firing people of PG's caliber isn't a solution, it's just another problem.

patrickwiseman 13 years ago

Don't worry I just figured out the totally bone-headed programming mistake I made at noon today. Time is a good mediator between skill and stress.

sgt 13 years ago

Much appreciated, pg. I knew that the "10 minutes of downtime" would not occur (fair enough, this was not related to the server upgrade).

calinet6 13 years ago

Ok, I'll just say it: that's just plain dumb. It's a rare case, but a simple condition would have checked and prevented this. :)

mempko 13 years ago

Did you hear about the tortious and the hare?

DrJosiah 13 years ago

Everyone fat-fingers a database at some point... Then you build interfaces so that you can't make the same mistake.

Jplenbrook 13 years ago

Why does PG maintain the website himself? I would think he would have many better things to do with his time.

pilas2000 13 years ago

That's funny because one of the top posts in progit yesterday was about the Hare and Tortoise algorithm

dylangs1030 13 years ago

Thanks for the explanation pg. As you said in the original thread, "you know how these things go..."

campnic 13 years ago

The nice thing about surgery with a computer program on a server is that death is not permanent.

orangethirty 13 years ago

It makes me feel good knowing better programmers than me go through the same issues I face. :)

thedaveoflife 13 years ago

I think this demonstrates how many people browse the /threads?id=pg page (myself included)

meshko 13 years ago

TIL there are still large web sites out there that do not have staging environment.

andreasklinger 13 years ago

I appreciate (if not love) the fact that you bugfix and server-change yourself.

True hacker spirit.

blantonl 13 years ago

Sorry about that.

No worries.

So, are we back on the new server? Or was this too much for one transition :)

Nux 13 years ago

I was almost sure it was Anonymous! ... Are you in Anonymous, pg? :D

afshinmeh 13 years ago

Same problem in Iran, I couldn't access to HN all last day.

wpeterson 13 years ago

I guess it's time for NewRelic to add an Arc agent.

w_t_payne 13 years ago

That sort of thing is fine for a startup in it's first year or two of life, but HN has been around for a while now ... surely you must have some sort of process by now?

bestest 13 years ago

So, uh, still fixing stuff in production?

cincinnatus 13 years ago

The cobbler's children have no shoes :-)

dahumpty 13 years ago

pg,

Just wondering as to why HN isn't hosted in the cloud? (e.g. on AWS, Rackspace etc.). How do you backup all the data?

  • wtracy 13 years ago

    Because that would cost more?

    I don't really know what the benefit of cloud hosting would be in this case.

nigo 13 years ago

I appreciate pg's frankness here.

keikun17 13 years ago

i hope that user wasn't me. i was editing a typo in comment right when it happened

DocG 13 years ago

I think we have a new king!

Awesome explanation.

eluos 13 years ago

"I am my own grandpa"

youngerdryas 13 years ago

>On a comment thread, a new user had posted some replies as siblings instead of children. I posted a comment explaining how HN worked. But then I decided to just fix it for him by doing some surgery in the repl.

No good deed goes unpunished!

People sometimes reply as sibling because they too impatient wait for the built-in delay on child comments.

Thanks for keeping the experiment going.

naturalethic 13 years ago

If the problem existed before the code update, why would you assume it was the code update that caused the problem?

bobsoap 13 years ago

After breaking many things myself due to similar, seemingly miniscule edits, I have implemented an ABC routine: Always Be Checking. Even if it was "just" something like moving a piece of code or something equally tiny, I always check after the fix.

So far, it has been working great.

jack57 13 years ago

Are you sure that comment's name wasn't Phillip J Fry?

  • jack57 13 years ago

    Apologies for the trivial comment