[Gb-ccb] Another idea for restoring onto a live root filesystem

Wed Nov 30 21:31:18 EST 2005

Regarding backing up, restoring and keeping all of the CCB
microdrives in sync.

It looks as though there may be a safe way of restoring a backup of a
root filesystem, without having to attach a CDROM drive, or boot from
the network.

When Linux boots, it loads a filesystem image called initrd (Initial
Ram Disk) into a temporary RAM disk. This image contains a minimal
root filesystem, which includes an initialization program. Said
initialization program uses utilities in the minimal root filesystem
until it is ready to mount the real root filesystem. It then tells the
kernel to switch to using the real root filesystem.

What I am thinking of doing is creating a custom initrd image, that
would do all of these steps except the last one, such that it would
continue to run Linux from the RAM disk version of the root
filesystem. I would try to include sufficient utilities in the
RAM-disk root filesystem to allow ssh connections, and to allow dump
and restore to be run. This would then enable safe backups/restores of
the hard-disk root filesystem, over the network.

I have found an example of creating a custom initrd image in the
"Preparing boot files" section of the following web page. In the case
of this web page, the custom initrd image is designed to be loaded
from the network. But I don't see any reason why it couldn't be loaded
from the hard disk instead.

  http://howtos.linux.com/howtos/Clone-HOWTO/index.shtml

FWIW, the introduction of the above web-page, includes the following
paragraph:

   "2.2. Why boot from a network

    Booting from hard disk would limit the possibilities of copying
    images. It wouldn't be possible, for instance, to safely copy to and
    from a partition mounted by the booted operating system."

Note that by "copying", the above paragraph is refering to copying
images of hard disk partitions.

My personal worries about restoring a backup onto a live
root-filesystem, include the following:

1. First of all the kernel has its own cache of what it thinks is in
    the root-filesystem, along with an ext3 journaling cache. However
    the /sbin/restore program bypasses the kernel, and writes directly
    to the underlying ext2 filesystem, including updating its
    meta-data.  Thus, after running /sbin/restore, the kernel's view of
    the filesystem's contents and metadata, versus what is actually on
    the disk, won't match. This may not matter if the kernel doesn't
    attempt to write anything to the disk, between the start of running
    /sbin/restore and the system being rebooted. However the root
    filesystem has to be mounted read/write while the restore program
    is running, and even if we immediately remounted the disk readonly,
    to prevent the shutdown process from writing to it, the act of
    switching it back to readonly might have the side-effect of syncing
    cached data to the disk.

    If data were written to the disk, then it might well be written to
    the wrong place on the disk, and potentially trash either the
    contents of a file or directory, or the filesystem metadata. The
    system would thereafter either be completely unbootable, or worse,
    contain an unknown corrupt file that could cause occasional
    unexplained crashes or wierd behavior.

    Thus, although we could try restoring a backup to a live
    filesystem, as suggested at the telecon, and not notice any
    problems, that wouldn't guarantee that some important file
    somewhere on the disk didn't get corrupted.

2. Less worrying, but potentially problematic, is the fact that a
    restore could overwrite a file that is read during shutdown, and
    thus cause the shutdown to hang. I don't believe that the
    /sbin/restore program preserves block assignments when it replaces
    a file (for example people sometimes dump and then immediately
    restore a filesystem to defragment it, and this depends on the
    block reordering that restore performs). So the blocks previously
    assigned to a re-written file, might end up containing part of a
    completely different file, and if the kernel's cache of the
    filesystem hierarchy pointed to the original start-block of the
    file, then strange things could happen.

    This might be less problematic than the first issue above, since we
    could then power-cycle to recover the system. However this would
    then fail if the ext3 journal didn't match, as per issue 1.

I don't know whether these worries are paranoid or not, but our
sysadmins here certainly don't think that it would be advisable to
attempt to restore a backup onto a live root-filesystem.

Martin