SQL Server Club
Home Articles Essential Guides Blogs Links Twitter Facebook Join Contact About
The Importance of SQL Server Backups
Web Monkey About the Author

With security clearance and well over a decade of experience with SQL Server, Web Monkey is the resident DBA at Norb Technologies and a real database guru.

A regular contributor to SQL Server Club, his background includes software development, hardware, a smattering of essential networking knowledge and an unhealthy interest in web security. He has an obsession with automating repetitive tasks, but has a tendency to forget where he left his last open bag of fruit and nut...

Typepad Profile RSS Feed Web Monkey on Twitter
Send to a Friend Printer Friendly Version
24 Hours: In a Server Room

We tend to put them to the back of our minds, but backups really are important. No, make that REALLY, job-loosing important.

When I first started as a junior DBA, a wise old guy told me to remember three things about data, in order of importance:

  1. Resilience
  2. Availability
  3. Performance

The first is the most important. It means always have a backup. Preferably off-server, better still, off-site. Even if your end users can't get to their data right now, it's better being able to get to it eventually than not at all.

Not at all is bad.
Not at all means boss time.
Not at all means angry people.
Not at all can easily mean new job time.

Sunday, 12:00pm

These words of wisdom has never been drummed home to me better than an incident last week here at Norb Towers. We'd had an annual power outage for routine maintenance work over the weekend. This was nothing unusual, just a checkup.

Sunday, 15:00

The power came back on, our servers were booted, no problems except for an old Ethernet router which had finally died. It wasn't labeled up, but I knew our off-site backups ran through it.

This was no cause for alarm though as the backups only ran on week nights, so it wasn't causing any issues, so long as we replaced it before the off-site backups ran on Monday night. No problem, we'd leave sorting it out until during the daytime on Monday so that we could go home after a hard Sunday of work.

Sunday, 19:00

Only thing was, on Sunday night the air-conditioning failed in the server room. No problem, it's monitored and someone will get a call.

Wrong...

It was monitored. Remember that old router? Well when it failed, it took out not just the link to our off-site backup centre, but also to our monitoring service. Usually this would have been flagged as a problem in its own right, as the servers in our server room would not respond to a ping from the monitoring centre. But our monitoring service knew we had scheduled a power outage and were simply awaiting our call to tell them to start monitoring again.

That call was never made...

Sunday, 22:00

Our monitoring service assumed the outage had taken longer than expected. In fairness to them, they did phone our infrastructure manager later that evening, and he told them about the router. So another assumption was made, this time at our end. We assumed that the server room would be ok until 8am the next day, without being monitored.

Wrong again...

Monday, 07:00

What happened? The air-conditioning went down shortly after 7pm, the servers carried on serving and the air temperature went up. Way, way up to 64°C in just 12 hours.

When our support guys came in on Monday morning, the walls in the server room were dripping where the last of the moisture in the plaster had been sucked out of the walls by the heat.

The bricks behind them had soaked up so much heat they were creating an oven effect. Some of the servers had already shut themselves down, but most were still running and pumping yet more heat out.

We took each server down, with the doors opened and waited for the air-con engineer to arrive. By 11am, the air-con was fixed but amazingly, heat was still coming out of the walls. We tentatively brought up each server, bracing ourselves for the important ones to boot.

Monday, 09:00

Sure enough, the one box that wouldn’t come back up was the most important – the one hosting Team Foundation Server, which contained our entire source code repository: the development team’s entire output. We checked the hardware and it was down to a disk failure – the entire disk array was refusing to report, even after multiple re-boots.

Fortunately, we scheduled in a job every week-day night to back these servers up to an off-site location. On-site backups wouldn’t have been very useful on this occasion as they would most likely have been located in the same server room for security, and thus would have been subject to the same vulnerabilities as the servers they were backing up.

Monday, 12:00

We replaced the router and the faulty drives in the disk array, re-established the link to our off-site location and kicked off a copy of the database backups. Fortunately every one restored without problem. Suffice to say, we ordered a new one and copied everything across to it the next day, just in case the original failed again.

In conclusion, we were very, very lucky this time.

If we’d not had the backups available to us AND we’d lost the development drives, we’d have had real problems recovering our business. Fortunately, we did. But as an added precaution, our backup frequency has now been increased to twice per week for critical servers such as our development server, and everything is taken off site. In addition we’ve added bi-monthly checks at our disaster recovery (DR) site to ensure we can restore randomly selected backups of critical servers onto a new box, should the worst happen and a backup get corrupted or become un-restorable for some reason.

We calculated that we can afford to loose a couple of days of coding, but after that the cumulative cost per hour of developers re-writing everything they’ve already coded as a result of just a single incident like this exceeds the annual cost of the extra backups and storage space. So we can easily justify the extra cost, even if something like this happens just once a year.

Remember, you can’t stop human error unless humans do nothing. You can only prepare yourself for the worst possible scenario. And in a high pressure world such as IT where everything's done to tight deadlines and nothing ever works quite how you might expect it to, preparing yourself for the worst is the only option if you want to keep your job as a DBA. If you recover from it, put it down to planning rather than luck, because at some point the latter will run out.

Send to a Friend Printer Friendly Version
Top of Page
© Norb Technologies 2007-2010 Privacy Policy Norb Technologies Devdex Feedback Home
Feedback Form