This is part 1 of a three part post:  Part 2, Part 3.

Recently I was talking with Mike Brumlow, one of R1Soft’s Linux device driver programmers about file systems and we talked about how if we… meaning both the Windows and Linux communities continue to use file systems that don’t guarantee consistency then we have to have a way to detect problems while the system is online.

We all know that no one wants to run checkdisk or fsck periodically on their server, no matter how good an idea it is.  This is especially true as disk sizes continue to grow exponentially.  Yet we hear over and over conventional widsom is to fsck or checkdisk your important volumes once every 3 or 6 months.

Who can deal with taking the time to take a server down for a 4 or 12 hour fsck?  Not me.

We also know that you cannot interrupt fsck once it starts… and if you ever rebooted during an fsck while it was running… you know what a mistake that was!  Ouch, you may have toasted your data!

In the process of our discussion Mike and I came up with an easy way to run checkdisk or fsck to repair the file system in a backup image using CDP 3.0 (more on that later).  Since we got into file system integrity I thought I would post some of the dirty details about what popular file systems actually do for data integrity.

File System Journaling

We often hear that file systems like NTFS and ext3 for example, have this idea called a file system journal.  In the case of the ext3 journal you can actually choose what level of data integrity you want (a very cool feature).

I know that this file system journal thing in theory keeps my data from being corrupt if there is a crash or other failure …assuming the hardware works flawlessly.  And with a file system journal we should in theory never have to do an fsck on a file system.

With a file system journal if there is a crash we just experience an automatic journal replay and off we go… consistent in a flash… or at least a couple of minutes.  The promise of never doing an fsck sounds great.

Does it really work like that?  What does the file system journal really accomplish for us?

Linux ext3 Journal – The Fine Print

While the a file system journal certainly sounds like the answer to our server’s data integrity needs in the event of crash or power loss, let’s read the fine print on ext3 first (I will examine Windows NTFS in my next post).   The Ext3 journal has several different modes and serves as an outstanding example.  It’s also one of the few file systems that even has an option for protecting integrity of actual file contents (more on that later).

I just google’d “ext3 journaling options” and what I found as the first link: http://www.linuxtopia.org/HowToGuides/ext3JournalingFilesystem.html

It starts out telling us why the ext3 journal is so great.

Wait… then it tells us there are 3 types of journal modes for ext3:

  1. writeback – greater speed at the price of limited data integrity. Allows old data to show up in files after a crash and relies on kernel’s standard writebacks to flush buffers.
  2. ordered -  that the data is consistent with the file system; recently-written files will never show up with garbage contents after a crash at the cost of some speed.
  3. journal - Journals all data requiring greater journal space and reduced performance. The most secure data retention policy.

Here is what that means.  Basically, unless you force the “journal” method of journaling you have a good chance of actually losing and corrupting data in a crash.  Again: As normal  ext3 will likely corrupt/lose data that was being written to in the event of a crash or power loss.

What can happen is that you might have blocks belong to a file that are bogus or don’t have the data you think they do!  Remember the O/S caches file writes in memory (at several layers) and through many tunable complex algorithms… basically at whim it flushes that to physical disk (where there is a hardware cache to consider also).

The default ext3 journal option is distribution specific.  Almost always and for example in RedHat/Centos land the ext3 journal option is set to writeback:
http://www.redhat.com/support/wpapers/redhat/ext3/

One mode, data=writeback, limits the data integrity guarantees, allowing old data to show up in files after a crash, for a potential increase in speed under some circumstances.

This mode, which is the default journaling mode for most journaling file systems, essentially provides the more limited data integrity guarantees of the ext2 file system and merely avoids the long file system check at boot time.

With writeback mode the file system structure itself will be consistent or easily made consistent with a journal replay.  As far as the contents of those files… Roll the dice!

And by the way, from everything I can tell the journaling provided by ext4 is identical to ext3.

Why would the world default to the journaling option that does not give us any real data integrity?  Has the world gone mad?  Well No.  The reason is that data integrity is VERY expensive.  And it’s expensive in the most precious of precious system resources.

That’s disk write performance.  It degrades disk write performance to have data integrity.  Disks are VERY EXTREMELY slow compared to CPU and memory and they are usually slowest when they are forced to actually write something.

Another good resources about ext3 journaling is:
http://www.ibm.com/developerworks/library/l-fs8.html

In the next post I’ll talk about the fine print of data integrity with NTFS and maybe some other file systems out there like reiserfs or XFS.