Something about restoring your server
Среда, 07 - Декабрь - 2011 Оставьте комментарий
1. What the restoring are you talking about?
A computer system is considered to be fault-tolerant not because you make backup copies. Fault-tolerant is the system which can be restored after the crash. No any backup procedure makes sence until you are able to restore data from a backup. The truth is that most of systems administrators have never tried to restore their servers from a backup, and this is just a matter of fact, mostly not depending neither on a company size nor on IT staff members count. This task usually is moved to the later/better time, re-assigned to other team members (so-called ”football game”) or is performed partially (”let’s try to restore the database only and consider everything to be o.k. if it is restored”).
But the thing is that failures do not warn you in advance.They are not gonna call you stating ”get ready, this will happen in a week” neither send you an e-mail message. So what problems do unexperienced in restoring servers administrators face when critical failure occurs? (Frankly, not only servers, but mission-critical workstations as well.)
The backup copy does not contain all the necessary data.
This is very typical situation from my experience: in some company, there was a backup of user documents and warehouse database successfully performing. However, after primary hard disk failure it was discovered that there are no e-mail files in a backup. On a general manager ”Why?” question the systems administrator has answered: ”Why should I copy? That was just an e-mail.” The issue was a technical support staff considered their own e-mail unimportant, thus extending this unimportance to all e-mail in particular company. Unfortunately, general manager’s e-mail had contained very important customer and supplier communications history, as well as contacts, pricelists etc.
In case administrator mentioned above had tried to perform data restore before the failure occured, it might be possible to find out that the backup procedure was not planned well.
Make copies of all data processed within the enterprise. In case anything is considered being not a subject for a backup, most probably you don’t need this data at all. Just delete it, free up the disk space. Oh, you can’t, it’s an important data? Make copies then.
There was another case in a company which primary business was selling tires. They have had a modern tire mounting machine controlled by some proprietary software installed on a PC-compatible computer running Windows XP. In fact, no one from IT was responsible of that computer —management have considered this computer does not need to be supported, and IT staff was not about to take this responsibility themselves. Eventually, the device was hit by a car and computer got damaged. There was no any backup copy of this computer, and the business software installation files were not available as well. Because of device manufacturer was not about to send the installation files separately, the company was forced to order a whole new computer from Germany. Considering the company was unable to run their business for a week, solving the whole incident was expensive enough.
Make copies of all mission-critical computers. The whole company may be dependent even on a single workstation which runs some special business software. It’s up to you to install the backup workstation and prepare spare parts before the failure occurs. If you consider some computer being unimportant, try to turn it off and see if it really is.
The backup procedure is not documented and tested.
While in theory, the restore procedure may look easy and sound; but in practice, you will definitely face issues that are not very common and are not discussed in books and forums even. Being under pressure, we try to address those issues in hurry by all means, all desperate measures.
I’ve faced some serious incident once that caused both IT and management staff to feel really nervous. A single failed power supply has burned out the server from inside — all the hard drives, motherboard and even processors were gone. There was no any backup server in place, so one of systems administrators gave us his own home computer, which was powerful enough to run the required server software. Unfortunately, the Windows Server 2003 operating system was unable to boot after being restored on that computer. The computer was resetting in loop during kernel initialization.
The basic reason for this was on surface and quite understandable — that was a kernel architecture incompatibility (Uniprocessor PC, Multiprocessor PC, ACPI PC and so on), but what should we do about it? No one was able to answer this question. Is the backup copy in place; is it available, does it contain all the data? Yes, it is. Is there an appropriate hardware and installation files? Yes. Well, but what next?
We’ve got it working after two days and two nights were spent trying to restore the system again and again. But still, till the very last moment no one knew if we gonna win. The very alternative solution like reinstalling everything from the scratch sounded terrible.
Test your recovery plan and document it carefully. List all scenarios of possible failures and appropriate solutions. Pay more attention to the complex, rare situations and reflect them in the knowledgebase. Make sure the responsible IT personnel understands your instructions and was trained to execute them step-by-step.
There is no software or hardware spares that are necessary for restore.
Sometimes we cannot perform restoration because of very simple things, something silly like IDE drive jumper or network interface card, something that is available in any store, but not when you’re in hurry. And it’s getting even worse when amount of such trifles is forming an overwhelming wave.
I was asked for assistance for some company, where the responsible personnel got stuck restoring the server. Everything went wrong from the beginning — the failure occured at the bad time, and the reason of failure was uncertain. When they have decided to change the whole server, it was discovered that the new server does not contain any hard disk. When they have aquired SATA disks and have decided to install them instead of required SCSI disks, it was discovered that the spare server does not have SATA controller in it’s motherboard. Where do you get a controller card on Friday’s night?
When they have got the required SATA card on Monday, it did not get any better. Surpise-surpise! It was discovered that they don’t have an installation CD. In fact, it never existed, but no one cared. To make things even more funny, the responsible staff have started searching for a CD key only when the installation has asked to enter it. Of course, it was too hard to prepare everything at a proper time.
It took a week for them to restore system functionality.
Another case was about the video surveillance system. It was doing its job in a pretty good manner: the management was examining the resource usage in a company, time to time the police asked to show faces of individuals who have paid with stolen credit cards and so on. But no one ever thought about how valuable the video is until the water flooded into the building and the video computer failed to boot.
The disk containing video data looked undamaged, so we have decided to reinstall the operating system only. Guess where we have stopped? There was no video capture card driver available, as well as appropriate video capture software. The hardware manufacturer has cancelled legacy capture card support for some time ago and offered us to buy new capture card and software bundle.
Solving this issue took a month until we got everything working. During that time, the company business processes were not stopped, but still, no one was really happy about video recording absence.
A backup itself is not the only thing you need to restore the system. Make sure you have all the installation files, serial numbers, manufacturer’s documentation, cable, tools and screwdriver even as well. Test your recovery plan before you find anything from this list is missing.
2. So, how do we do the server backup right way?
The typical server backup includes the following three data types:
The documents and databases that may be backed up with a simple copying;
Databases that require some special backup procedures to be copied (MS SQL, MS Exchange and other);
System State, installed software and server OS Registry.
Usually, the first two entries of this list are not a problem for a systems administrator, so let’s leave them and move on to the last one. It is very common for administrators to have some misunderstanding about it. Windows OS Backup and Restore topic includes a concept of System State. What is it? For a typical Windows Server, it includes:
Boot files, system files;
COM+ Class Registration database;
Internet Information Server (IIS) Metadirectory, if the service is installed;
Certificate Authority database, if installed;
Cluster Service information, if installed;
Active Directory domain database, if installed;
Windows system and Program File folders (Windows Server 2008 and later).
When planning disaster recovery procedures for particular Windows Server, systems administrators consider System State having its value because of Registry and Active Directory copy. But the issue is that this type of backup does not include all the installed software. Neither user profiles are included.
There was another case in my practice when Terminal Server running Microsoft Office has failed. I have decided to make a clean install and restore System State that was previously prepared. The procedure was running smoothly and has finished in a success. But when I’ve tried to install Microsof Office suite, it started making fun of me. The installation procedure was pretty sure MS Office is already installed, thus refusing continuing setup. Well, this information was taken from a restored Registry.
All right, let’s remove Office from the Control Panel and install once again? No way. The Windows Installer service protocol file was not present on a filesystem. The issue was Windows Installer folder was not included in a System State, the only sytem backup I have had available.
I’ve started from the scratch. I’ve formatted the disk and reinstalled the server. Reinstalled MS Office and restored the backup over again. Guess what? It did not get any better. Trying to update Office Service Pack later I was fighting the same dragon again, having received different error messages. And still, the worst problem were lost user profiles with all the signatures, favorites and other per-user settings.
Make backup of all system drive as a whole, not the System State only. Yes, Microsoft has revised the System State contents. Since Windows 2008/Vista, System State also includes Windows and Program Files folders. Still, it’s not the right time to relax. Make sure you backup user profiles (if not roaming) and particular software files stored outside system folders.
Do not make copies of Domain Controller using standard disk cloning software. In case this is the only controller in a domain, multiple workstations may lose their Secure Channel to the domain after restoring the clone. And even more, if image was taken more than 30 days ago, all the workstations lose the domain connectivity. The reason is domain members change their passwords every 30 days by default.
In case domain is supported by multiple controllers, consequences of restoring controller from a disk clone are hard to recover from. For more information, see Microsoft TechNet articles “How to detect and recover from a USN rollback in Windows Server 2003” (http://support.microsoft.com/kb/875495) and ”Fixing Replication Lingering Object Problems” (http://technet.microsoft.com/en-us/library/cc738018(WS.10).aspx).
Automate copy creation by means of Task Scheduler. Any single eventually created copy may become useless because there is almost no network that remains static and unchanged. Everything is changing — there are OS and business software updates installed, user profiles are living their lifes, and outdated data eventually becomes obsolete. In contrast, performing scheduled backup jobs robot does not leave his work to be done some later time, tomorrow, next week and so on.
Make multiple copies. Keep at least three copies — local for quick access; network for copy to be available in case whole server burnout; offsite (for example, kept at home) for the business to be continued even in case of local nature disaster to happen.
Check the quality of backups.It may seem that the copying is going quite well, but some issues like disk ran out of space or loss of connectivity during network transfers may not be that obvious. The backup file presents, it seems to be big enough, but what’s inside? Let the backup script read an event log and send you an e-mail message.
3. Some nuances about server recovery.
It’s not a question of ”if”, it’s a question of ”when”. Either the system becomes unbootable or someone accidentaly deletes the wrong Organizational Unit in Active Directory, and server needs to be restored from a backup. Which should be created before failure occurs, is verified and contains all the necessary data.
Restoring server on a new hardware (Windows 2000/2003).
It may happen that after moving the system disk or after restoring its copy to the new hardware, you will get an unbootable server. This may look like automatically resettng during kernel initialization or blue screen with INACCESSIBLE_BOOT_DEVICE message:
In case server was restored using NTBackup program, it is possible to fix it with Device Manager. After restore is complete, do not reboot the system, stay in Safe Mode/Directory Services Restore Mode. Open Device Manager console and update ATA controller device driver to the Standard one and/or update Computer driver to the Standard PC:
If the server still does not start, boot the computer from Windows installation CD and execute Repair procedure. Make sure CD is the same Windows version as installed and it has the same Service Pack integrated into.
When Repair is complete, you may need to reinstall all the post-ServicePack updates, check device driver versions and verify system user profiles (Default User, All Users).
Restoring Active Directory with multiple Domain Controllers.
Occasionally, some part of information stored in Active Directory may have been deleted or tampered with — for instance, some important user account or the whole Organizational Unit was lost. Such changes will be replicated to all controllers in a domain very quickly.
The straightforward attempt to restore System State on a single controller will fail. Let’s look how two controllers communicate. Say, we have two DCs, Alpha and Bravo. On Tuesday, Managers OU with all the containing information was accidentaly deleted on Alpha. All the managers’ user accounts, groups and other subordinate objects was removed and the replication took place already. Well, it’s not gonna be that bad, we have Monday’s System State backup in place! Let’s reboot Alpha in Directory Services Restore Mode, restore System State and go to the normal mode back again.
Monday’s Alpha wakes up on Tuesday, yawns and checks it’s scheduler. Oh, it has been so long dream, now it’s time to replicate! Let me contact my replication partner Bravo and ask him for news.
Alpha: Hello Mr. Bravo. The last time we have contacted was Monday. What’s new since then?
Bravo: Hello Mr. Alpha. I’ve got some news for you. While you were away, we have decided to delete Managers OU on Tuesday. Be so kind, update your domain database, too.
Alpha: Roger that, performing deletion!
Oopsie… So how do we restore a unit which is being automatically deleted after the very first replication session? For such a restore to be successful, we perform so-called Authoritative Restore. When the restore is done, we close the backup program but do not reboot to the normal mode. Instead, we launch the ntdsutil command:
Start -> Run -> ntdsutil
C:\WINDOWS\system32\ntdsutil.exe: authoritative restore
authoritative restore: restore subtree OU=Managers,DC=MyCompany,DC=com
Note that you may need to perform additional actions to fill up the restored accounts with group membership data. You may find more information about ntdsutil command parameters and restoring deleted user accounts in the Microsoft Support Knowledge Base, “How to restore deleted user accounts and their group memberships in Active Directory” (http://support.microsoft.com/kb/840001) article.
When performing computer network audits, I always ask systems administrators to show the recovery plan, the backup itself and to demonstrate data and system restoration skills. Unfortunately, in most cases, it seems rather unlikely to receive a high mark on all items. Frankly, I am not ideal about managing backup, and not any close to it even, but it’s much better to set a goal and consistently achieve it than simply leaving it at random. Let the number of failures in your network equals the number of successful restores!