A dreadful time has come, a Linux system no longer
boots up properly. Whether this was caused through our own curiosity,
hardware failure, a friend or co-worker "helping out" is the past.
This article will you bring back precious data or functionality
of a Linux machine. We can no longer access valuable data and services
that our hitherto always available Linux system used to provide.
Our intended audience is fledgling administrators, but we might
have a trick or two for the old school guru as well.
The first task is simple: determine what you want
to bring back. Generally this means restoring the ability to get
to a shell prompt, all file systems mounted and network connectivity
restored. This functionality is usually provided by runlevel
3. For more information on runlevels, read man page for init,
section 8 (shorthand is man:init(8)). The more specific the
recovery goal, the simpler our planning. We'll walk through diagnosing
the problem, some specific fixes, and refer to documentation for
Recovering any system to full and proper functioning
is not required in many instances. If the jewel to be recovered
is data in a directory or on a partition, booting to another operating
system (on floppy disk or hard drive) and copying the data may be
all that is required for minimal system recovery. System recovery
may be recovering data, normal booting, network access, a specific
application or user data. If your user data (/home) is on
a separate partition, reinstalling Linux and losing the system configuration
may be the fastest method of restoring user data. Since Linux has
more advanced techniques for system recovery beyond reinstalling
the operating system, we'll walk through common disaster recovery
techniques. Following are some of the more common disasters and
how we recovered from them.
Determining the Problem and Possible Fixes
Now that we've decided what to recover, in this
case booting into mutli-user mode, we need to determine what the
problem is. Our initial troubleshooting practices should ascertain
where the problem is: hardware or software, configuration or libraries,
Narrowing down the source of the problem is done
in stages, as each stage of booting Linux is completed. Actual fixes
depend entirely on your specific environment distribution and problem.
Please read the noted documentation and understand the techniques
before applying them blindly.
||Stage 1: LILO (Linux Loader)
||Stage 2: Loading the kernel
||Stage 3: Mounting the Disks
||Stage 4: Startup Scripts
||Stage 5: Runlevel Scripts
||Stage 6: Providing a Login Prompt
After the BIOS screen, LILO should run. If LILO
runs, we can safely assume that the most of the hardware is functioning,
and that the Master Boot Record or MBR, is still loading. Each of
the letters in LILO signifies a different part of LILO has loaded.
The most common problem is seeing "LI" and then nothing else. A
common reason is another operating system like NT loaded on top
of Linux. To resolve this problem you need to wipe out the broken
MBR (using Windows "fdisk /mbr" or "dd if=/dev/zero of=/dev/hda
bs=512 count=1" and reinstall LILO (/sbin/lilo) back
into the MBR. A current boot disk is needed to run /sbin/lilo
Other common problems for LILO include: errors
in /etc/lilo.conf, limitations in the BIOS pointing to IDE
drives, installing Linux to a FAT partition and running defrag within
DOS, etc. Most problems can be fixed by verifying the options and
mappings inside /etc/lilo.conf, running /sbin/lilo
and rebooting. I always create a /boot directory with the
kernel, bootstrap (boot.b) and system.map near the
beginning of my disk to avoid cylinder (1024cyl) and size (>2GB)
issues with the BIOS. When the kernel is located on a SCSI drive,
put the key word linear in the lilo.conf. Basic documentation
for LILO is provided in man:lilo(8), man:liloconfig(8) and
man:lilo.conf(5) and extensive documentation for LILO is
Loading the Kernel
After LILO has run, it hands off control to
the kernel image listed in the lilo.conf. If the kernel image
is corrupted, messages vary from silence to core dumps. If someone
recently upgraded the kernel, they should have made a boot disk with
functional kernel, and kept the old kernel bootable under a different
alias. Use the boot disk or boot to the older kernel and work on creating
a new functioning kernel. If the original kernel is corrupted, you
may find a copy of your kernel on the distribution cd or inside "/usr/src/linux"
Problems loading hardware might also be listed
during this time; the log is located in /var/log/dmesg. Additional
or alternate kernel log locations are boot.log or kern.log
depending on distribution. HOWTO documents and distribution specific
kernel guides are your best references for creating kernels. Problems
with a particular section of the kernel (advanced power management
for example) are most directly addressed by the kernel mailing list.
www.kernel.org is the Linux
kernel homepage, with additional references and documentation available
through web site links.
Mounting the Disks
The kernel loads, then mounts the partitions listed
under /etc/fstab. man:fstab(5) provides file format and directive
information. All of the mount points must be accessible during boot
time. If a mount point fails, the system will prompt for root password
(configurable and varies across distributions) and then boot to
single user mode (Run Level 1). If a disk volume is not accessible
during boot, it's time to use single user mode and comment out the
mount line in the /etc/fstab for a temporary fix.
If a disk was not cleanly unmounted, fsck
will check for errors with automatic settings. These settings will
resolve simple problems but actual errors need to be fixed by running
fsck manually in Single User mode. Fsck should only be run
on unmounted filesystems. Read man:fsck(8) and man:e2fsck(8)
for more information.
If the superblock has been corrupted, fsck
can restore one of the backup superblocks. The location of the backup
superblock is dependent on the filesystem's blocksize. For ext2
filesystems with 1k blocksizes, a backup superblock can be found
at block 8193; for filesystems with 2k blocksizes, at block 16384;
and for 4k blocksizes, at block 32768. To use one of these superblocks,
run fsck -b [blocknumber] /dev/[harddrive]. For example,
"fsck -b 8193 /dev/hda" for the first IDE drive with 1k blocksizes.
,b>fsck -B [blocksize] doesn't require any math, only knowing the
block size used.
When using partitioning utilities from several
different operating systems, an inconsistency may develop regarding
the boundaries of the partitions. The starting and ending blocks
will be listed twice when fdisk displays the partition table. This
is rare, but does happen on multiboot systems. My recommendation
is to always use a 3rd party utility that consistently understands
all the filesystems used, specific to all the versions of all the
filesystems (NT with SP3, SP4 and SP5 and Windows 2000 all have
different versions of NTFS for example) or use utilities like fdisk,
cfdisk, sfdisk, disk druid, etc under one operating system. Spoon
feeding an operating system a blank partition, of a certain type
or a formatted filesystem generally works, some OSes require blank
space. Last time I had this issue, I used a popular third party
partitioning software and it resolved the issue.
Startup scripts are very distribution specific.
Most distributions place the scripts in /etc/ or /etc/rc.d/
with the startup scripts in either rc.boot or rcS.d.
In addition to the actual mounting, hardware detection occurs, networking
is configured, hostname specified, clocks started, portmaps rendered
and console settings declared. This is happy stuff that we need for
the next section. If one of these scripts fails, find out what hardware
it depends on (isapnp for sound card failed? then maybe you have a
sound card hardware issue or a kernel module issue). If any complaints
or error messages are displayed, few problems will stop the boot sequence.
Research the script in use and look for man pages for these scripts.
Try man -k "scriptname" if you can't find documentation.
After executing the initial startup scripts,
either runlevel 3 (multiuser shell) or 5 (Xwindow) will be invoked
depending on the configuration in inittab. Read man:inittab(5)
for differences between run levels and initial configurations. PCMCIA
is started along with the user, system and network daemons. Any number
of these can fail, depending on configuration. Both Apache and Sendmail
will hang without a proper hostname, etc. The daemon failing should
be reported on the screen and in the normal messages log. Check the
software documentation or turn off the daemon in question. chkconfig
--list will display run levels and daemons under Redhat and mandrake.
Linuxconf has a module for controlling the behavior of different services.
Debian start up scripts should be managed using update-rc.d
in accordance with the debian policy manual.
The login prompt relies on a subsystem and a set
of programs, in addition to the username and password validation.
If there is a problem with logging in, common problems include caps
lock, forgotten passwords (recover root password from single user
mode), trojan login scripts and rootkits. Most trojans and rootkits
try to convince users that the system was not comprimised by providing
a moderately consistent user experience. Validate packages using
rpm -V or files using a file verification program like tripwire
if "strange" behavior is noticed. www.securityfocus.com
lists extensive security related resources, and provides information
about rootkits, trojans and utilities like tripwire
Recovery Tools of the Trade
Single User Mode
Single User Mode is extremely useful for working
on a sick system. Minimal system configuration is loaded, most file
systems aren't mounted, and the Bourne Shell, sh, is loaded
without any profile and basic environment. Single user mode can
be accessed several ways. While selecting a image in LILO, single
user mode is one the options that can be specified, e.g. "Linux
single" Additionally, single user mode can be set from multiuser
mode "telinit 1" and opportunities to access single user
mode are offered when critical services fail.
Making a boot disk can be done with utilities
or by copying the boot.img file from the Redhat CD. Redhat
and Mandrake offer mkbootdisk for creating up to date bootable
floppies. Create a bootdisk with the current kernel with the command
"mkbootdisk `uname -r`" Debian offers mkrboot which
creates kernel and rootimage bootable from a single floppy. Most
distributions also offer bootable floppy images that need to be
copied from the distribution cd to a floppy disk. The dd command
"dd if=[path]/boot.img of=/dev/fd0" will copy the boot image
to floppy, bit by bit. This disk will be bootable, but will not
contain a current kernel or drive map.
With versions of Redhat prior to 6.2, a rescue
image was also provided, containing basic file utilities to restore
the system. Text editors and disk utilities provide a basic level
of support to restore the system to a bootable state.
Some Linux partition utilities are gpart
and parted which offer partition tools and recovery. I have
used the commercial product, Partition Magic, successfully in the
past to correct many filesystem and partition problems.
When all else fails...
After the utilities and recovery modes have failed,
or the problem is too complex to resolve through simple means (like
the time I deleted over half of the software packages on my system
including all the text editors) then reloading the operating system
will provide the most direct means of recovering the system. User
data may be preserved, but system configurations will be lost. Setup
varies according to distribution and configuration. Redhat's "Server"
installation wipes all partitions, "Workstation" installation preserves
some partitions and installs into blank space. "Custom" grants the
most control and should be used for reinstalls. Debian installation
is user controlled and very flexible, and partitions are mounted
and assigned manually.
Linux is a very flexible operating system, with
granular control. This control is demonstrated through the recovery
methods available. Another great feature of Linux is the support
available from the community and the documentation available. Contact
Linuxcertified.com for comments, suggestions or training classes
in becoming a more effective system administrator.
About the Author:Ian Smith is a trainer
at Linuxcertified.com as well as a consultant in Silicon Valley.
Ian also worked on creating a backup appliance with user control,
versioning, network backups and easy file restores.