Linux Recovery: Whitepaper on recovering a Linux system

Linux & Open-Source Training Certification News & Links Company Info

Recovering a Linux System

A dreadful time has come, a Linux system no longer boots up properly. Whether this was caused through our own curiosity, hardware failure, a friend or co-worker "helping out" is the past. This article will you bring back precious data or functionality of a Linux machine. We can no longer access valuable data and services that our hitherto always available Linux system used to provide. Our intended audience is fledgling administrators, but we might have a trick or two for the old school guru as well.

The first task is simple: determine what you want to bring back. Generally this means restoring the ability to get to a shell prompt, all file systems mounted and network connectivity restored. This functionality is usually provided by runlevel 3. For more information on runlevels, read man page for init, section 8 (shorthand is man:init(8)). The more specific the recovery goal, the simpler our planning. We'll walk through diagnosing the problem, some specific fixes, and refer to documentation for additional information.

Recovering any system to full and proper functioning is not required in many instances. If the jewel to be recovered is data in a directory or on a partition, booting to another operating system (on floppy disk or hard drive) and copying the data may be all that is required for minimal system recovery. System recovery may be recovering data, normal booting, network access, a specific application or user data. If your user data (/home) is on a separate partition, reinstalling Linux and losing the system configuration may be the fastest method of restoring user data. Since Linux has more advanced techniques for system recovery beyond reinstalling the operating system, we'll walk through common disaster recovery techniques. Following are some of the more common disasters and how we recovered from them.

Determining the Problem and Possible Fixes

Now that we've decided what to recover, in this case booting into mutli-user mode, we need to determine what the problem is. Our initial troubleshooting practices should ascertain where the problem is: hardware or software, configuration or libraries, etc.

Boot Stages

Narrowing down the source of the problem is done in stages, as each stage of booting Linux is completed. Actual fixes depend entirely on your specific environment distribution and problem. Please read the noted documentation and understand the techniques before applying them blindly.

Stage 1: LILO (Linux Loader)

Stage 2: Loading the kernel

Stage 3: Mounting the Disks

Stage 4: Startup Scripts

Stage 5: Runlevel Scripts

Stage 6: Providing a Login Prompt

LILO

After the BIOS screen, LILO should run. If LILO runs, we can safely assume that the most of the hardware is functioning, and that the Master Boot Record or MBR, is still loading. Each of the letters in LILO signifies a different part of LILO has loaded. The most common problem is seeing "LI" and then nothing else. A common reason is another operating system like NT loaded on top of Linux. To resolve this problem you need to wipe out the broken MBR (using Windows "fdisk /mbr" or "dd if=/dev/zero of=/dev/hda bs=512 count=1" and reinstall LILO (/sbin/lilo) back into the MBR. A current boot disk is needed to run /sbin/lilo or dd.

Other common problems for LILO include: errors in /etc/lilo.conf, limitations in the BIOS pointing to IDE drives, installing Linux to a FAT partition and running defrag within DOS, etc. Most problems can be fixed by verifying the options and mappings inside /etc/lilo.conf, running /sbin/lilo and rebooting. I always create a /boot directory with the kernel, bootstrap (boot.b) and system.map near the beginning of my disk to avoid cylinder (1024cyl) and size (>2GB) issues with the BIOS. When the kernel is located on a SCSI drive, put the key word linear in the lilo.conf. Basic documentation for LILO is provided in man:lilo(8), man:liloconfig(8) and man:lilo.conf(5) and extensive documentation for LILO is in /usr/share/doc/lilo/manual.gz.

Loading the Kernel

After LILO has run, it hands off control to the kernel image listed in the lilo.conf. If the kernel image is corrupted, messages vary from silence to core dumps. If someone recently upgraded the kernel, they should have made a boot disk with functional kernel, and kept the old kernel bootable under a different alias. Use the boot disk or boot to the older kernel and work on creating a new functioning kernel. If the original kernel is corrupted, you may find a copy of your kernel on the distribution cd or inside "/usr/src/linux"

Problems loading hardware might also be listed during this time; the log is located in /var/log/dmesg. Additional or alternate kernel log locations are boot.log or kern.log depending on distribution. HOWTO documents and distribution specific kernel guides are your best references for creating kernels. Problems with a particular section of the kernel (advanced power management for example) are most directly addressed by the kernel mailing list. www.kernel.org is the Linux kernel homepage, with additional references and documentation available through web site links.

Mounting the Disks

The kernel loads, then mounts the partitions listed under /etc/fstab. man:fstab(5) provides file format and directive information. All of the mount points must be accessible during boot time. If a mount point fails, the system will prompt for root password (configurable and varies across distributions) and then boot to single user mode (Run Level 1). If a disk volume is not accessible during boot, it's time to use single user mode and comment out the mount line in the /etc/fstab for a temporary fix.

If a disk was not cleanly unmounted, fsck will check for errors with automatic settings. These settings will resolve simple problems but actual errors need to be fixed by running fsck manually in Single User mode. Fsck should only be run on unmounted filesystems. Read man:fsck(8) and man:e2fsck(8) for more information.

If the superblock has been corrupted, fsck can restore one of the backup superblocks. The location of the backup superblock is dependent on the filesystem's blocksize. For ext2 filesystems with 1k blocksizes, a backup superblock can be found at block 8193; for filesystems with 2k blocksizes, at block 16384; and for 4k blocksizes, at block 32768. To use one of these superblocks, run fsck -b [blocknumber] /dev/[harddrive]. For example, "fsck -b 8193 /dev/hda" for the first IDE drive with 1k blocksizes. ,b>fsck -B [blocksize] doesn't require any math, only knowing the block size used.

When using partitioning utilities from several different operating systems, an inconsistency may develop regarding the boundaries of the partitions. The starting and ending blocks will be listed twice when fdisk displays the partition table. This is rare, but does happen on multiboot systems. My recommendation is to always use a 3rd party utility that consistently understands all the filesystems used, specific to all the versions of all the filesystems (NT with SP3, SP4 and SP5 and Windows 2000 all have different versions of NTFS for example) or use utilities like fdisk, cfdisk, sfdisk, disk druid, etc under one operating system. Spoon feeding an operating system a blank partition, of a certain type or a formatted filesystem generally works, some OSes require blank space. Last time I had this issue, I used a popular third party partitioning software and it resolved the issue.

Startup Scripts

Startup scripts are very distribution specific. Most distributions place the scripts in /etc/ or /etc/rc.d/ with the startup scripts in either rc.boot or rcS.d. In addition to the actual mounting, hardware detection occurs, networking is configured, hostname specified, clocks started, portmaps rendered and console settings declared. This is happy stuff that we need for the next section. If one of these scripts fails, find out what hardware it depends on (isapnp for sound card failed? then maybe you have a sound card hardware issue or a kernel module issue). If any complaints or error messages are displayed, few problems will stop the boot sequence. Research the script in use and look for man pages for these scripts. Try man -k "scriptname" if you can't find documentation.

Runlevel Scripts

After executing the initial startup scripts, either runlevel 3 (multiuser shell) or 5 (Xwindow) will be invoked depending on the configuration in inittab. Read man:inittab(5) for differences between run levels and initial configurations. PCMCIA is started along with the user, system and network daemons. Any number of these can fail, depending on configuration. Both Apache and Sendmail will hang without a proper hostname, etc. The daemon failing should be reported on the screen and in the normal messages log. Check the software documentation or turn off the daemon in question. chkconfig --list will display run levels and daemons under Redhat and mandrake. Linuxconf has a module for controlling the behavior of different services. Debian start up scripts should be managed using update-rc.d in accordance with the debian policy manual.

Login Prompt

The login prompt relies on a subsystem and a set of programs, in addition to the username and password validation. If there is a problem with logging in, common problems include caps lock, forgotten passwords (recover root password from single user mode), trojan login scripts and rootkits. Most trojans and rootkits try to convince users that the system was not comprimised by providing a moderately consistent user experience. Validate packages using rpm -V or files using a file verification program like tripwire if "strange" behavior is noticed. www.securityfocus.com lists extensive security related resources, and provides information about rootkits, trojans and utilities like tripwire

Recovery Tools of the Trade

Single User Mode

Single User Mode is extremely useful for working on a sick system. Minimal system configuration is loaded, most file systems aren't mounted, and the Bourne Shell, sh, is loaded without any profile and basic environment. Single user mode can be accessed several ways. While selecting a image in LILO, single user mode is one the options that can be specified, e.g. "Linux single" Additionally, single user mode can be set from multiuser mode "telinit 1" and opportunities to access single user mode are offered when critical services fail.

Boot Disk

Making a boot disk can be done with utilities or by copying the boot.img file from the Redhat CD. Redhat and Mandrake offer mkbootdisk for creating up to date bootable floppies. Create a bootdisk with the current kernel with the command "mkbootdisk `uname -r`" Debian offers mkrboot which creates kernel and rootimage bootable from a single floppy. Most distributions also offer bootable floppy images that need to be copied from the distribution cd to a floppy disk. The dd command "dd if=[path]/boot.img of=/dev/fd0" will copy the boot image to floppy, bit by bit. This disk will be bootable, but will not contain a current kernel or drive map.

Rescue Disk

With versions of Redhat prior to 6.2, a rescue image was also provided, containing basic file utilities to restore the system. Text editors and disk utilities provide a basic level of support to restore the system to a bootable state.

Utilities

Some Linux partition utilities are gpart and parted which offer partition tools and recovery. I have used the commercial product, Partition Magic, successfully in the past to correct many filesystem and partition problems.

When all else fails...

After the utilities and recovery modes have failed, or the problem is too complex to resolve through simple means (like the time I deleted over half of the software packages on my system including all the text editors) then reloading the operating system will provide the most direct means of recovering the system. User data may be preserved, but system configurations will be lost. Setup varies according to distribution and configuration. Redhat's "Server" installation wipes all partitions, "Workstation" installation preserves some partitions and installs into blank space. "Custom" grants the most control and should be used for reinstalls. Debian installation is user controlled and very flexible, and partitions are mounted and assigned manually.

Linux Recovery

Linux is a very flexible operating system, with granular control. This control is demonstrated through the recovery methods available. Another great feature of Linux is the support available from the community and the documentation available. Contact Linuxcertified.com for comments, suggestions or training classes in becoming a more effective system administrator.

About the Author:Ian Smith is a trainer at Linuxcertified.com as well as a consultant in Silicon Valley. Ian also worked on creating a backup appliance with user control, versioning, network backups and easy file restores.