Apologies and postmortem for Silicon Florist downtime this morning

Rick Turoczy

11 years ago

You know how those awesome Web services always do some sort of statement or postmortem after their sites are down? It’s awesome, right? And insightful.

Well, this won’t be that. Nor is it a site that’s considered “mission critical.”

But you have to understand—as silly as it may be—I feel as if you and I have an unwritten and unspoken SLA. That my bad prose and crappy headlines will be here whenever you need them. And when they’re not? I feel as if I’ve let you down.

Long story short, I’m sorry that this happened.

So let me dissect the downtime, this morning. If only for cathartic reasons. It wasn’t a happy time.

That said, I’m not going to throw my host under the bus. This stuff happens. Screws fall out all of the time. The world is an imperfect place.

Precursor: On Friday evenings, I send out the Silicon Florist email newsletter—via Mailchimp (<3)—which recaps stories from the previous week and directs users to the site for additional details. Apparently, a good number of folks read this email on Saturday morning. As some form of torture or something.

So here we go…

6:55AM PDT: Darren Stowell kindly alerts me that none of the links in the Silicon Florist newsletter are working. I immediately try the site and am unable to connect to anything.

That’s odd, I think. I haven’t been mucking with WordPress. Or breaking it like I usually do. And, honestly, this error appears to happening even before the site is hit. What happened?

My first thought? My original host was acquired a few years back. I suddenly wonder if maybe some domain name server stuff might be out of whack.

But it’s early in the morning. So I fall for opting to refresh the page. In the hopes that some sort of Internet reverse entropy will take hold. And fix the whole thing. So I press that little circling arrow. Like every 30 seconds. From my phone.

7:18AM PDT: After a series of refreshes and increasing nervousness and inbound emails, I submit a ticket to my host. With a sort of “WTF?”

8:39AM PDT: No response from host. But they’re a small local shop. And it’s a Saturday morning. So I press on my assumption, “It’s looking like it might be a DNS issue…? Are you seeing anything on your end?”

8:54AM PDT: I send a tweet to my host asking about response times.

@CanvasDreams What's the expected wait time on high priority tickets on the weekend?

— Rick Turoczy (@turoczy) April 6, 2013

9:00AM PDT: Still no response from my host. So I run a traceroute to test my assumptions. Sure enough. The traceroute to siliconflorist.com shows the connection stalling out before it hits my host. I upload my traceroute to the support ticket to help with diagnosing the problem.

9:10AM PDT: I solicit help on Twitter. Asking if any other West Coasters are experiencing issues.

Anyone else on the West Coast having weird site issues this morning? Here's my traceroute http://t.co/MMV14MbaUH

— Rick Turoczy (@turoczy) April 6, 2013

Andrew Hyde lends a little moral support. Aaron Hockley confirms that he is not experiencing any issues.

9:15AM PDT: I jump back into email to discover that Adam Boettiger has been kind enough to run his own traceroute and provide some DNS insights. Basically confirming my suspicions.

DNS checks

Delegation

Superfluous name server listed at parent: ns1.taproothosting.com

A name server listed at the parent, but not at the child, was found. This is most likely an administrative error. You should update the parent to match the name servers at the child as soon as possible.

Superfluous name server listed at parent: ns2.taproothosting.com

A name server listed at the parent, but not at the child, was found. This is most likely an administrative error. You should update the parent to match the name servers at the child as soon as possible.

Total parent/child glue mismatch.

The parent lists name servers that the child doesnt know about, see details in advanced. This configuration could actually work but breaks very easily if one of these zones change slightly.

Nameserver

granite.canvasdreams.com.

Everything is fine.

All tests successful in this part, no errors or warnings.

granite2.canvasdreams.com.

Everything is fine.

All tests successful in this part, no errors or warnings.

Consistency

Everything is fine.

9:16AM PDT: I submit a snarky “*crickets* Hello?” message in an attempt to get a response from my host’s tech support.

9:19AM PDT: My host’s tech support finally confirms that they’re getting my messages and looking into the problem.

9:24AM PDT: I respond to the ticket. Realizing that their support for these time periods is in the UK, I apologize for mucking with a Saturday evening.

9:44AM PDT: With no additional communication, I upload Adam’s traceroute and DNS assessment. I postulate that my initial Occam’s Razor assumption—that something from the acquisition days has failed, mainly the DNS records—is still the best assumption and ask if I should change the DNS records.

9:49AM PDT: My host’s support crew sends another “looking into it” message.

10:14AM PDT: Out of frustration, I change the nameservers to those highlighted by Adam and advise the support team that I am doing so. But encourage them to look into the issue, in case other people are still on the same old DNS.

10:20AM PDT: Frustration levels with my technical ineptitude are running high.

That moment when frustration over a technical issue transcends from anger to abject depression

— Rick Turoczy (@turoczy) April 6, 2013

10:31AM PDT: Early propagation points are correctly directing folks to my site. I confirm with Twitter that other folks are seeing the same thing.

Think @siliconflorist is fixed. For the sake of my sanity, please let me know if you can see it? Thanks in advance http://t.co/7MbSuHMYqq

— Rick Turoczy (@turoczy) April 6, 2013

Oh. Um. And I'd like to apologize to all of those folks who may have accidentally read the content on @siliconflorist.

— Rick Turoczy (@turoczy) April 6, 2013

10:50AM PDT: Enough confirmations have rolled in that I’m feeling a little better that the problem may be fixed.

Currently: My ticket remains open. Still “waiting on tech.” Waiting final confirmation from the host that my workaround was appropriate.

Again, I apologize for the downtime

I realize that Silicon Florist isn’t critical to your business. But I feel like we’ve got a good thing going here. You know, with me writing. And with you overlooking my shitty writing to extract some value out of the content.

It’s symbiotic.

So I thought I owed you an explanation.

Sorry for the downtime. My sincere apologies. Seriously. But I think we’ve got it fixed.

We now return you to your regularly scheduled barely intelligible gibberish.