DO NOT USE REISERFS IF YOU VALUE DATA INTEGRITY.

Here's a piece of mail I wrote summarizing why not:

I have an 80K file of why Reiserfs is evil and should not be used.
Really, I should stick it on the web.  Hans told me personally
(online, so I have it quoted) that RSFS is optimized for speed,
-not- correctness.

(This started when I discovered that having the machine reset or lose
power without syncing the FS would put random bits of open files in
each other, so, e.g., my wtmp had four lines of the XFree86-config
file I was editing at the time, and X wouldn't start 'cause the config
file had a piece of yet a -third- file smashed into the middle of it.
Reiser said that this is -correct- behavior---who cares if the -data-
is wrong if the filesystem -structure- is correct?)

-I-, on the other hand, say that a filesystem that corrupts data and
then -claims to be 100% intact- (e.g., fsck says it's fine) is no file
system at all, since there's no way to even -tell- things are trashed
without checking every file with your backups.  I threw RSFS out the
window, am using ext3fs, and never looked back.

...and here's the documentation I refer to above. Note that the email addresses have been sanitized as a spam-prevention measure, and some of the messages have been omitted altogether if I wasn't sure they were public.

LATE BREAKING NEWS: See also for even more goodies on why you shouldn't ever be trusting your data to ReiserFS...

- - - Begin forwarded messages - - -

Date: Sat, 22 Sep 2001 06:00:43 -0400 (EDT)
From: foner-reiserfs@med
To: linux-kernel@vge
Subject: ReiserFS data corruption in very simple configuration

[Please CC me on any replies; I'm not on linux-kernel.]

The ReiserFS that comes with both Mandrake 7.2 and 8.0 has
demonstrated a serious data corruption problem, and I'd like
to know (a) if anyone else has seen this, (b) how to avoid it,
and (c) how to determine how badly I've been bitten.

My configuration in each case has been an AMD CPU running ReiserFS
exactly as configured "out of the box" by running the Mandrake 7.2 or
8.0 installation CD and opting to run ReiserFS instead of the default.
This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID
or anything fancy like that.  The hardware itself is rock solid and
has never demonstrated any faults at all.  (MDK 8.0 appears to use
RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.)
The machine had barely been used before each corruption problem; I'm
not running some strange root-priv stuff, and each time, the FS hadn't
had more than a few minutes to a few hours of use since being created.

In each case, I've gotten in trouble by editing my XF86Config-4 file,
guessing wrong on a modeline, hanging X (blank gray screen & no
response to anything), and being forced to hit the reset button
because nothing else worked.  Under 7.2, I discovered that my
XF86Config-4 file suddenly had a block of nulls in it.  That time, I
thought I must have been hallucinating, but I ran a background job to
sync the filesystem every second while continuing to debug the X
problems, and didn't see the corruption again.

Now, I was just bitten by the -same- behavior under MDK 8.0.  After
accidentally hanging X, I waited a few seconds just in case a sync was
pending, hit reset, and had all sorts of lossage:
 (1) Parts of the XF86Conf-4 file had lines garbled, e.g.,
     sections of the file had apparently been rearranged.
 (2) /var/log/XFree86.0.log was truncated, and maybe garbled.
 (2) Logging in as root was fine, but then logging in as myself
     I got "Last login: <4-5 lines of my XFree86.0.log file (!)>"
     instead of a date!  Logging in again gave me the proper
     last-login time, but clearly wtmp or something else had
     gotten stepped on in some weird way. 
Obviously, the behavior I saw once under MDK 7.2 was no hallucination
or accidental yank in Emacs.

I thought the whole point of a journalling file system was to
-prevent- corruption due to an unexpected failure!  This seems to be
-far- worse than a normal filesystem---ext2fs would at least choke and
force fsck to be run, which might actually fix the problem, but this
is ridiculous---it just silently trashes random files.

So I now have possibly-undetected filesystem damage.  My -guess- is
that only files written within a few minutes of the reset are likely
to be affected, but I really don't know, and don't know of a good way
to find out.  Must I reinstall the OS -again-, starting from a blank
partition, to be sure?  Maybe I should just give up on ReiserFS completely.

[If there is a more-appropriate place for me to send this---such as 
a particular Mandrake list, or a particular ReiserFS list---please let
me know, particularly if I can get a quick answer -without- going
through the overhead of subscribing to the list, being flooded, and
unsubscribing---that's what archives are for.  Some websearching
for "ReiserFS corruption" yielded -thousands- of hits---not a good
sign---and a very large proportion of them were on this list, so I
figure this is as good a place to ask as any.  Thanks again.]

- - - Separator between forwarded messages - - -

Date: Sat, 22 Sep 2001 16:47:31 +0400
From: Nikita Danilov <Nikita@Nam>
To: foner-reiserfs@med
Subject: Re: ReiserFS data corruption in very simple configuration
Cc: linux-kernel@vge,
       Reiserfs mail-list <Reiserfs-List@Nam>
References: <200109221000.GAA11263@out-of-band.media.mit.edu>

foner-reiserfs@med writes:
> [Please CC me on any replies; I'm not on linux-kernel.]
> 
> The ReiserFS that comes with both Mandrake 7.2 and 8.0 has
> demonstrated a serious data corruption problem, and I'd like
> to know (a) if anyone else has seen this, (b) how to avoid it,
> and (c) how to determine how badly I've been bitten.
> 
> My configuration in each case has been an AMD CPU running ReiserFS
> exactly as configured "out of the box" by running the Mandrake 7.2 or
> 8.0 installation CD and opting to run ReiserFS instead of the default.
> This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID
> or anything fancy like that.  The hardware itself is rock solid and
> has never demonstrated any faults at all.  (MDK 8.0 appears to use
> RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.)
> The machine had barely been used before each corruption problem; I'm
> not running some strange root-priv stuff, and each time, the FS hadn't
> had more than a few minutes to a few hours of use since being created.
> 
> In each case, I've gotten in trouble by editing my XF86Config-4 file,
> guessing wrong on a modeline, hanging X (blank gray screen & no
> response to anything), and being forced to hit the reset button
> because nothing else worked.  Under 7.2, I discovered that my
> XF86Config-4 file suddenly had a block of nulls in it.  That time, I
> thought I must have been hallucinating, but I ran a background job to
> sync the filesystem every second while continuing to debug the X
> problems, and didn't see the corruption again.
> 
> Now, I was just bitten by the -same- behavior under MDK 8.0.  After
> accidentally hanging X, I waited a few seconds just in case a sync was
> pending, hit reset, and had all sorts of lossage:
>   (1) Parts of the XF86Conf-4 file had lines garbled, e.g.,
>       sections of the file had apparently been rearranged.
>   (2) /var/log/XFree86.0.log was truncated, and maybe garbled.
>   (2) Logging in as root was fine, but then logging in as myself
>       I got "Last login: <4-5 lines of my XFree86.0.log file (!)>"
>       instead of a date!  Logging in again gave me the proper
>       last-login time, but clearly wtmp or something else had
>       gotten stepped on in some weird way.
> Obviously, the behavior I saw once under MDK 7.2 was no hallucination
> or accidental yank in Emacs.
> 
> I thought the whole point of a journalling file system was to
> -prevent- corruption due to an unexpected failure!  This seems to be
> -far- worse than a normal filesystem---ext2fs would at least choke and
> force fsck to be run, which might actually fix the problem, but this
> is ridiculous---it just silently trashes random files.

Stock reiserfs only provides meta-data journalling. It guarantees that
structure of you file-system will be correct after journal replay, not
content of a files. It will never "trash" file that wasn't accessed at
the moment of crash, though. Full data-journaling comes at cost. There
is patch by Chris Mason <Mason@Sus> to support data journaling in
reiserfs. Ext3 supports it also.

> 
> So I now have possibly-undetected filesystem damage.  My -guess- is
> that only files written within a few minutes of the reset are likely
> to be affected, but I really don't know, and don't know of a good way
> to find out.  Must I reinstall the OS -again-, starting from a blank
> partition, to be sure?  Maybe I should just give up on ReiserFS completely.
> 
> [If there is a more-appropriate place for me to send this---such as
> a particular Mandrake list, or a particular ReiserFS list---please let
> me know, particularly if I can get a quick answer -without- going

Reiserfs mail-list <Reiserfs-List@Nam>,
archive at http://marc.theaimsgroup.com/?l=reiserfs&r=1&w=2

> through the overhead of subscribing to the list, being flooded, and
> unsubscribing---that's what archives are for.  Some websearching
> for "ReiserFS corruption" yielded -thousands- of hits---not a good
> sign---and a very large proportion of them were on this list, so I
> figure this is as good a place to ask as any.  Thanks again.]

Nikita.

- - - Separator between forwarded messages - - -

Date: Sat, 22 Sep 2001 16:44:21 -0400 (EDT)
From: foner-reiserfs@med
To: Nikita@Nam
Subject: ReiserFS data corruption in very simple configuration
CC: linux-kernel@vge, Reiserfs-List@Nam

   Date: Sat, 22 Sep 2001 16:47:31 +0400
   From: Nikita Danilov <Nikita@Nam>

   Stock reiserfs only provides meta-data journalling. It guarantees that
   structure of you file-system will be correct after journal replay, not
   content of a files. It will never "trash" file that wasn't accessed at
   the moment of crash, though.

Thanks for clarifying this.  However, I should point out that the
failure mode is quite serious---whereas ext2fs would simply fail
to record data written to a file before a sync, reiserfs seems to
have instead -swapped random pieces of one file with another-,
which is -much- harder to detect and fix.  I can live with uncommitted
changes, but it's hard to justify the behavior I saw---it means that
any even slightly-busy machine that experiences a crash could have
dozens or hundreds of files with each others' contents all mixed
together---remember, parts of my XF86Config file wound up in wtmp!
And both XF86Config and wtmp had been written at least 20 seconds
before I had to push the reset button, and perhaps > 30 seconds; I
don't recall how often the FS is syncing by default, but it's
disturbing behavior.  After all, at the time I pushed reset, I had
-no- files actually being written by any process under my direct
control; I'd merely written one file out from Emacs under a minute
earlier.  I'd hate to think of what would happen if this sort of thing
hit a CVS repository.

This seems to outweigh the convenience of a rapid start after failure
(one of the reasons I decided to try reiserfs in the first place),
because a failure means possibly having to recover an entire file
server from backups (hence losing -lots more- data) because you don't
know which files might have been trashed if the machine loses power or
the kernel panics.  There's no simple test ("did my edits make it into
the file?"), and no way to really know if the machine might later
misbehave because critical files have been overwritten with parts of
others.  (This inability to easily figure out what might have been
affected also means that the damage will rapidly propagate to backups,
hence making the backups useless.)  About the only way around it would
seem to be to checksum every file in the FS at regular intervals, and
rechecksum after a crash---at which point, what's the point of not
having to run fsck?

Is this -really- how reiserfs is supposed to behave in a crash?
It seems like this should be prominently documented in the description
of the file system---I know that I'm rather nervous about using it if
this is true, since it turns a few minutes of fsck'ing (for ext2fs)
into a restore-the-whole-file-system exercise instead.  Surely that's
not right.  If this is really supposed to be how reiserfs behaves any
time it doesn't get to sync before a machine dies on it, I can't see
how it can be justified for any production use, and I'll probably have
to reinstall my OS using ext2fs instead.

                            Full data-journaling comes at cost. There
   is patch by Chris Mason <Mason@Sus> to support data journaling in
   reiserfs. Ext3 supports it also.

Do you have a URL for this?  A search for reiserfs and mason yields
12,000 hits.  (I'm particularly interested in one for reiserfs 3.6.25
and Mandrake 8.0, but I assume there may be several variants in the
same repository.)

    > So I now have possibly-undetected filesystem damage.  My -guess- is
    > that only files written within a few minutes of the reset are likely
    > to be affected, but I really don't know, and don't know of a good way
    > to find out.  Must I reinstall the OS -again-, starting from a blank
    > partition, to be sure?  Maybe I should just give up on ReiserFS   completely.
    > 
    > [If there is a more-appropriate place for me to send this---such as
    > a particular Mandrake list, or a particular ReiserFS list---please let
    > me know, particularly if I can get a quick answer -without- going

   Reiserfs mail-list <Reiserfs-List@Nam>,
   archive at http://marc.theaimsgroup.com/?l=reiserfs&r=1&w=2

Thanks.  I saw that list before, and I'm glad that you've included it
in this discussion.

- - - Separator between forwarded messages - - -

Date: Sat, 22 Sep 2001 14:02:40 -0700
From: Andrew Morton <akpm@zip>
To: foner-reiserfs@med
Subject: Re: ReiserFS data corruption in very simple configuration
Sender: akpm@vas
References: <15276.34915.301069.643178@bet> 
(message from Nikita Danilov on Sat, 22 Sep 2001 16:47:31 +0400)
<200109222044.QAA11638@out-of-band.media.mit.edu>

The *default* journalling mode for ext3 will write data before
committing metadata.  So this will never happen with ext3.  Guaranteed.

- - - Separator between forwarded messages - - -

Date: Sat, 22 Sep 2001 18:07:10 -0400 (EDT) 
From: Lenny Foner <foner@med>
To: akpm@zip
Subject: ReiserFS data corruption in very simple configuration

   Date: Sat, 22 Sep 2001 14:02:40 -0700
   From: Andrew Morton <akpm@zip>

   The *default* journalling mode for ext3 will write data before
   committing metadata.  So this will never happen with ext3.  Guaranteed.

That's good to know.  (How robust is this against the sort of caching
that typically goes on in disk drives, btw?)
I haven't checked---is this the default FS for MDK 8.0 if I haven't
selected reiserfs?  (My use of the term "ext2fs" was because I haven't
really kept up in the FS arena, so I'm probably somewhat out-of-date.)
How mature is ext3fs in general?

I'm seriously considering punting reiserfs if I don't get a good
answer to the questions I put to the lists, especially considering the
number of hits I'm getting to data-corruption queries on search
engines.  They seem to paint a picture of much less product maturity
than the reiserfs authors do themselves.

- - - Separator between forwarded messages - - -

Date: Sat, 22 Sep 2001 15:47:00 -0700 
From: Andrew Morton <akpm@zip>
To: Lenny Foner <foner@med>
Subject: Re: ReiserFS data corruption in very simple configuration
Sender: akpm@vas
References: <3BACFC70.73EDBD39@zip> (message from Andrew Morton 
on Sat, 22 Sep 2001 14:02:40 -0700)   
<200109222207.SAA11674@out-of-band.media.mit.edu>

Lenny Foner wrote:
> 
>     Date: Sat, 22 Sep 2001 14:02:40 -0700
>     From: Andrew Morton <akpm@zip>
> 
>     The *default* journalling mode for ext3 will write data before
>     committing metadata.  So this will never happen with ext3.  Guaranteed.
> 
> That's good to know.  (How robust is this against the sort of caching
> that typically goes on in disk drives, btw?)

Much-discussed point.

It's write-reordering that could bring ext3 unstuck.  The
time window and set of circumstances is really remote though.  For
starters, the drive would have to decide, when presented with
a linear sequence of blocks, to write the higher-numbered ones
first.  Nobody has been able to demonstrate a problem yet,
to my knowledge.

> I haven't checked---is this the default FS for MDK 8.0 if I haven't
> selected reiserfs?  (My use of the term "ext2fs" was because I haven't
> really kept up in the FS arena, so I'm probably somewhat out-of-date.)

Mandrake are shipping ext3.  I don't know if it's the default,
like redhat.

> How mature is ext3fs in general?

Not very mature at all, but it's undergone a heap of testing.
I'm quite confident in it.

> I'm seriously considering punting reiserfs if I don't get a good
> answer to the questions I put to the lists, especially considering the
> number of hits I'm getting to data-corruption queries on search
> engines.  They seem to paint a picture of much less product maturity
> than the reiserfs authors do themselves.

As Hans says, "that is the nature of metadata-only journalling".  Same
goes for JFS and, in theory, XFS.  Although XFS is said to get it
mostly-right as a side-effect of something else (not sure what).  But even
XFS does have data corruption problems across recovery.

All of this is precisely why I started the 2.4 kernel port of
ext3 - I need it for an embedded "applicance" product, and
people don't expect an applicance to shit itself if they simply
turn it off...

- - - Separator between forwarded messages - - -

Date: Sun, 23 Sep 2001 00:06:26 -0400 (EDT)
From: Lenny Foner <foner@med>
To: akpm@zip
Subject: ReiserFS data corruption in very simple configuration

   Date: Sat, 22 Sep 2001 15:47:00 -0700
   From: Andrew Morton <akpm@zip>

   Lenny Foner wrote:
   > 
   >     Date: Sat, 22 Sep 2001 14:02:40 -0700
   >     From: Andrew Morton <akpm@zip>
   > 
   >     The *default* journalling mode for ext3 will write data before
   >     committing metadata.  So this will never happen with ext3.     Guaranteed.
   > 
   > That's good to know.  (How robust is this against the sort of caching
   > that typically goes on in disk drives, btw?)

   Much-discussed point.

I'll bet.

   It's write-reordering that could bring ext3 unstuck.  The
   time window and set of circumstances is really remote though.  For
   starters, the drive would have to decide, when presented with
   a linear sequence of blocks, to write the higher-numbered ones
   first.  Nobody has been able to demonstrate a problem yet,
   to my knowledge.

I can see it deciding to do that if the head happened to be near there
already, but I'm obviously making an assumption about how block
numbers are laid out on the disk.  But you see my point.

   > I haven't checked---is this the default FS for MDK 8.0 if I haven't
   > selected reiserfs?  (My use of the term "ext2fs" was because I haven't
   > really kept up in the FS arena, so I'm probably somewhat out-of-date.)

   Mandrake are shipping ext3.  I don't know if it's the default,
   like redhat.

It's the default for RH?  Interesting.  I'll have to check my MDK.

   > How mature is ext3fs in general?

   Not very mature at all, but it's undergone a heap of testing.
   I'm quite confident in it.

And I suppose RH must be too, if it's their default, and presumably
that's the majority of the testing base?  (I'll do some websearching
for mandrake & ext3fs and see what I find.)

   > I'm seriously considering punting reiserfs if I don't get a good
   > answer to the questions I put to the lists, especially considering the
   > number of hits I'm getting to data-corruption queries on search
   > engines.  They seem to paint a picture of much less product maturity
   > than the reiserfs authors do themselves.

   As Hans says, "that is the nature of metadata-only journalling".  Same
   goes for JFS and, in theory, XFS.  Although XFS is said to get it
   mostly-right as a side-effect of something else (not sure what).  But even
   XFS does have data corruption problems across recovery.

Sounds completely wrong compared to what a FS is supposed to do,
though.  Missing data is fine, but corrupting what's left just
seems like it makes a bad situation worse.  People really put up
with this in production use?  How?

   All of this is precisely why I started the 2.4 kernel port of
   ext3 - I need it for an embedded "applicance" product, and
   people don't expect an applicance to shit itself if they simply
   turn it off...

Aaaahhhh...  I see.  Yeah, the fact that my TiVo doesn't have an off
switch really gives me the screaming fantods; there's just -no- way to
tell it, "I'm about to unplug you now."  But OTOH, the way they do it
is to have pretty much everything but /var and the MFS (the FS they
store the video in) mounted ro, and they've actually got two copies of
/, /var, and /usr (hence six partitions for those, plus extra
partitions for MFS and various other things; something like 14+ in
total), so if one comes up corrupted, it uses the other and then
repairs the corrupted one; the boot prom keeps track of which one is
current, I think.  (Dunno what happens if it can't repair it;
presumably it either copies one to the other or calls for help over
the phone; it's not clear to me without rereading its source code.)

- - - Separator between forwarded messages - - -

Date: Sat, 22 Sep 2001 21:18:59 -0700
From: Andrew Morton <akpm@zip>
To: Lenny Foner <foner@med>
Subject: Re: ReiserFS data corruption in very simple configuration
Sender: akpm@vas
References: <3BAD14E4.CC3544B9@zip> (message from Andrew Morton on 
Sat, 22 Sep 2001 15:47:00 -0700)  
<200109230406.AAA11988@out-of-band.media.mit.edu>

Lenny Foner wrote:
> 
>     Date: Sat, 22 Sep 2001 15:47:00 -0700
>     From: Andrew Morton <akpm@zip>
> 
>     Lenny Foner wrote:
>     >
>     >     Date: Sat, 22 Sep 2001 14:02:40 -0700
>     >     From: Andrew Morton <akpm@zip>
>     >
>     >     The *default* journalling mode for ext3 will write data before
>     >     committing metadata.  So this will never happen with ext3.     Guaranteed.
>     >
>     > That's good to know.  (How robust is this against the sort of caching
>     > that typically goes on in disk drives, btw?)
> 
>     Much-discussed point.
> 
> I'll bet.
> 
>     It's write-reordering that could bring ext3 unstuck.  The
>     time window and set of circumstances is really remote though.  For
>     starters, the drive would have to decide, when presented with
>     a linear sequence of blocks, to write the higher-numbered ones
>     first.  Nobody has been able to demonstrate a problem yet,
>     to my knowledge.
> 
> I can see it deciding to do that if the head happened to be near there
> already, but I'm obviously making an assumption about how block
> numbers are laid out on the disk.  But you see my point.

I do.  The time window is small though - you need to get an out-of-order
write like this AND pull the plug between the two blocks.

>     > I haven't checked---is this the default FS for MDK 8.0 if I haven't
>     > selected reiserfs?  (My use of the term "ext2fs" was because I haven't
>     > really kept up in the FS arena, so I'm probably somewhat out-of-date.)
> 
>     Mandrake are shipping ext3.  I don't know if it's the default,
>     like redhat.
> 
> It's the default for RH?  Interesting.  I'll have to check my MDK.
> 
>     > How mature is ext3fs in general?
> 
>     Not very mature at all, but it's undergone a heap of testing.
>     I'm quite confident in it.
> 
> And I suppose RH must be too, if it's their default, and presumably
> that's the majority of the testing base?  (I'll do some websearching
> for mandrake & ext3fs and see what I find.) 

It went through RedHat QA OK.  Better than reiserfs...

>     > I'm seriously considering punting reiserfs if I don't get a good
>     > answer to the questions I put to the lists, especially considering the
>     > number of hits I'm getting to data-corruption queries on search
>     > engines.  They seem to paint a picture of much less product maturity
>     > than the reiserfs authors do themselves.
> 
>     As Hans says, "that is the nature of metadata-only journalling".  Same
>     goes for JFS and, in theory, XFS.  Although XFS is said to get it
>     mostly-right as a side-effect of something else (not sure what).  But  even
>     XFS does have data corruption problems across recovery.
> 
> Sounds completely wrong compared to what a FS is supposed to do,
> though.  Missing data is fine, but corrupting what's left just
> seems like it makes a bad situation worse.  People really put up
> with this in production use?  How?
> 
>     All of this is precisely why I started the 2.4 kernel port of
>     ext3 - I need it for an embedded "applicance" product, and
>     people don't expect an applicance to shit itself if they simply
>     turn it off...
> 
> Aaaahhhh...  I see.  Yeah, the fact that my TiVo doesn't have an off
> switch really gives me the screaming fantods; there's just -no- way to
> tell it, "I'm about to unplug you now."  But OTOH, the way they do it
> is to have pretty much everything but /var and the MFS (the FS they
> store the video in) mounted ro, and they've actually got two copies of
> /, /var, and /usr (hence six partitions for those, plus extra
> partitions for MFS and various other things; something like 14+ in
> total), so if one comes up corrupted, it uses the other and then
> repairs the corrupted one; the boot prom keeps track of which one is
> current, I think.  (Dunno what happens if it can't repair it;
> presumably it either copies one to the other or calls for help over
> the phone; it's not clear to me without rereading its source code.)

hum.  Where does one find the TiVo source?

- - - Separator between forwarded messages - - -

Date: Tue, 25 Sep 2001 14:28:54 +0100
From: "Stephen C. Tweedie" <sct@red>
To: foner-reiserfs@med
Subject: Re: ReiserFS data corruption in very simple configuration
Cc: Nikita@Nam, Stephen Tweedie <sct@red>,
        linux-kernel@vge
References: <15276.34915.301069.643178@bet>  <200109222044.QAA11638@out-of-band.media.mit.edu>

Hi,

On Sat, Sep 22, 2001 at 04:44:21PM -0400, foner-reiserfs@med wrote: 

>     Stock reiserfs only provides meta-data journalling. It guarantees that
>     structure of you file-system will be correct after journal replay, not
>     content of a files. It will never "trash" file that wasn't accessed at
>     the moment of crash, though.
> 
> Thanks for clarifying this.  However, I should point out that the
> failure mode is quite serious---whereas ext2fs would simply fail
> to record data written to a file before a sync, reiserfs seems to
> have instead -swapped random pieces of one file with another-,
> which is -much- harder to detect and fix.

Not true.  ext2, ext3 in its "data=writeback" mode, and reiserfs can
all demonstrate this behaviour.  Reiserfs is being no worse than ext2
(the timings may make the race more or less likely in reiserfs, but
ext2 _is_ vulnerable.) 

e2fsck only restores metadata consistency on ext2 after a crash: it
can't possibly guarantee that all the data blocks have been written.

ext3 will let you do full data journaling, but also has a third mode
(the default), which doesn't journal data, but which does make sure
that data is flushed to disk before the transaction which allocated
that data is allowed to commit.  That gives you most of the
performance of ext3's fast-and-loose writeback mode, but with an
absolute guarantee that you never see stale blocks in a file after a
crash.

Cheers,
 Stephen

- - - Separator between forwarded messages - - -

Date: Thu, 27 Sep 2001 09:56:36 +0200
From: Milos Prudek <prudek@nem>
To: foner-reiserfs@med
Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple  configuration
References: <200109222044.QAA11638@out-of-band.media.mit.edu>

>     Stock reiserfs only provides meta-data journalling. It guarantees that
>     structure of you file-system will be correct after journal replay, not
>     content of a files. It will never "trash" file that wasn't accessed at
>     the moment of crash, though.
>
> Thanks for clarifying this.  However, I should point out that the
> failure mode is quite serious---whereas ext2fs would simply fail

Hey, that was good. Finally someone with enough courage to tell the reiserfs 
developers about it, clearly but not in an insulting way.  

When I was hit by reiserfs data trashing due to lack of data journaling, I 
felt CHEATED. The reiserfs hype led me to believe what you believed.

I'm desperate for Chris Mason's patch. 

--
Milos Prudek

- - - Separator between forwarded messages - - -

Date: Sat, 29 Sep 2001 00:44:59 -0400 (EDT)
From: Lenny Foner <foner-reiserfs@med>
To: sct@red
Subject: ReiserFS data corruption in very simple configuration
CC: Nikita@Nam, Mason@Sus
CC: linux-kernel@vge, reiserfs-list@Nam

[As before, please make sure you CC me on replies or I won't see them.  Tnx!]

    Date: Tue, 25 Sep 2001 14:28:54 +0100
    From: "Stephen C. Tweedie" <sct@red>

    Hi,

    On Sat, Sep 22, 2001 at 04:44:21PM -0400, foner-reiserfs@med wrote:

    >     Stock reiserfs only provides meta-data journalling. It guarantees  that
    >     structure of you file-system will be correct after journal replay, not
    >     content of a files. It will never "trash" file that wasn't accessed at
    >     the moment of crash, though.
    > 
    > Thanks for clarifying this.  However, I should point out that the
    > failure mode is quite serious---whereas ext2fs would simply fail
    > to record data written to a file before a sync, reiserfs seems to
    > have instead -swapped random pieces of one file with another-,
    > which is -much- harder to detect and fix.

    Not true.  ext2, ext3 in its "data=writeback" mode, and reiserfs can
    all demonstrate this behaviour.  Reiserfs is being no worse than ext2
    (the timings may make the race more or less likely in reiserfs, but
    ext2 _is_ vulnerable.)

ext2fs can write parts of file A to file B, and vice versa, and this
isn't fixed by fsck?  [See outcome (d) below.]  I'm having difficulty
believing how this can be possible for a non-journalling filesystem.

    e2fsck only restores metadata consistency on ext2 after a crash: it
    can't possibly guarantee that all the data blocks have been written. 

But what about written to the wrong files?  See below.

    ext3 will let you do full data journaling, but also has a third mode
    (the default), which doesn't journal data, but which does make sure
    that data is flushed to disk before the transaction which allocated
    that data is allowed to commit.  That gives you most of the
    performance of ext3's fast-and-loose writeback mode, but with an
    absolute guarantee that you never see stale blocks in a file after a
    crash.

I've been getting a stream of private mail over the last few days
saying one thing or another about various filesystems with various
optional patches, so let me get this out in the open and see if we can
converge on an answer here.  [ext2f2, ext3fs, and reiserfs answers
should feel free to cite which mode they're talking about and URLs for
whatever patches are required to get to that mode; some impressions
about reliability and maturity would be useful, too.]

Let's take this scenario:  Files A and B have had blocks written to
them sometime in the recent past (30 to 60 seconds or so) and a sync
has not happened yet.  (I don't know how often reiserfs will be synced
by default; 60 seconds?  Longer?  Presumably running "sync" will force
it, but I don't know when else it will happen.)  File A may have been
completely rewritten or newly written (e.g., what Emacs does when it
saves a file), whereas file B may have simply been appended to (e.g.,
what happens when wtmp is updated).

The CPU reset button is then pushed.  [See P.P.S. at end of this message.]

Now, we have the following possibilities for the outcome after the
system comes back up and has finished checking its filesystem:

(a) Metadata correctly written, file data correctly written.
(b) Metadata correctly written, file data partially written.
    (E.g., one or both files might have been partially or completely
    updated.) 
(c) Metadata correctly written, file data completely unwritten.
    (Neither file got updated at all.)
(d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B.
    (E.g., File A gets some of file B written somewhere within it,
    and file B gets some of file A written somewhere within it---this
    is the behavior I observed, at least twice, with reiserfs.)
(e) Metadata corrupted in some fashion, file data undefined.
    ("Undefined" means could be any of (a) through (d) above; I don't care.)

Now, which filesystems can show each outcome?  I don't know.  I
contend that reiserfs does (d).  Stephen Tweedie talks above about
whether we can "guarantee that all the data blocks have been written",
but may be missing the point I was making, namely that THE BLOCKS HAVE
BEEN WRITTEN TO THE WRONG FILES.

It would be nice to know, for each of ext2fs, ext3fs, and reiserfs,
what the -intended- outcome is, and what the -actual- outcome is
(since implementation bugs might make the actual outcome different
from the intended outcome).  Any additional filesystems anyone would
like to toss into the pot would be welcome; maybe I'll post a matrix
of the results, if we get some.

I'm -assuming- that the intended outcome for reiserfs (without data
journalling) is one of (a), (b), or (c).  If the intended outcome for
reiserfs without data journalling [or -any- FS, really] is in fact
(d), then I don't understand how this filesystem can be intended for
any reliable service, since a failure will garble all files written in
the last several seconds in a fashion that is very, very difficult to
unscramble.  (-Perhaps-, if all the metadata is indeed correct, it
would be possible to at least -identify- which files may have gotten
smashed, by looking for all files whose mtime or ctime is in the last
60 seconds (more?) before the failure, but they'd still be trashed in
bizarre ways---it's much easier to fix a file (particularly a text
file) that is simply out of date (having had only some, or none, of
its recent data written) then it is to fix one that's had data from
other file(s) added to it in random places.  Furthermore, files such
as wtmp will probably get their mtime modified the instant the system
comes back up, further muddying the waters.)

Can someone(s) help to address the above?  And, even better, could
this information be placed prominently on the web pages describing the
relevant file systems?  It would be extremely useful for people trying
to decide which one to run to know this -before- they have committed
umpteen gigabytes to one or the other and -then- get bitten.

Thanks!

P.S.  Nikita Danilov said that there is a data-journalling patch to
reiserfs written Chris Mason <Mason@Sus>, but has not responded
with a URL to it; can someone (or Chris? now CC'ed) do so?  A search
for reiserfs and mason is useless, yielding 12,000 hits.  (I'm
particularly interested in one for reiserfs 3.6.25 and Mandrake 8.0,
but I assume there may be several variants in the same repository.)
Benchmarking data on the performance impact of data journalling for
reiserfs, ext3fs, and anything else anyone cares to supply would
probably be useful to lots of people at well.

P.P.S.  I say reset and not power-off, although I hope that this is
moot, because I presume that the unsynced data, by virtue of being
unsynced, is nowhere near the disk datapaths anyway.  But either way,
a reset should let the disks continue to write data out of their write
buffers, assuming that a CPU reset doesn't flush such pending
transactions; I don't know if there's some IDE bus sequence that can
do this, and whether CPU reset would issue such a sequence.  It may
not matter; is it common that disks might leave data buffered but
unwritten for 30 seconds if there is no other disk activity?  I would
hope that this is -not- true and that the buffered data is buffered
only while there is other activity, since failing to flush the buffer
when the disk is idle only increases the risk of losing it without
improving performance at all.

- - - Separator between forwarded messages - - -

Date: Sat, 29 Sep 2001 14:52:29 +0200
From: <pcg@goo ( Marc) (A.) (Lehmann )>
To: Lenny Foner <foner-reiserfs@med>
Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple  configuration
Cc: sct@red, Nikita@Nam, Mason@Sus,
        linux-kernel@vge, reiserfs-list@Nam
Mail-Followup-To: Lenny Foner <foner-reiserfs@med>,
       sct@red, Nikita@Nam, Mason@Sus,
       linux-kernel@vge, reiserfs-list@Nam
References: <20010925142854.A5384@red> <200109290444.AAA19624@out-of-band.media.mit.edu>
X-Operating-System: Linux version 2.4.8-ac8 (root@cer) (gcc version 3.0.1) 

On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner <foner-reiserfs@med>  wrote:
> isn't fixed by fsck?  [See outcome (d) below.]  I'm having difficulty
> believing how this can be possible for a non-journalling filesystem. 

If you have difficulties in believing this, may I ask you how you think it
is possible for a non-journaling filesystem to prevent this at all?

> But what about written to the wrong files?  See below. 

What you see is most probably *old* data, not data from another (still
existing) file.

> has not happened yet.  (I don't know how often reiserfs will be synced
> by default; 60 seconds?  Longer?  Presumably running "sync" will force 

mostly like with any other filesystem (man bdflush)

> Now, we have the following possibilities for the outcome after the

> (a) Metadata correctly written, file data correctly written.

all filesystems ;)

> (b) Metadata correctly written, file data partially written.
>     (E.g., one or both files might have been partially or completely
>     updated.) 

ext2, reiserfs.

> (c) Metadata correctly written, file data completely unwritten.
>     (Neither file got updated at all.)

ext2, reiserfs.

> (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B.

this shouldn't happen on reiserfs. however, the unwritten parts of file a can  easily
contain data formerly in file b.

> (e) Metadata corrupted in some fashion, file data undefined.
>     ("Undefined" means could be any of (a) through (d) above; I don't care.)

this should be prevented by journaling (of course, this won't help against
harddisk failures) on reiserfs. ext2 often has this problem, but fsck usually
can repair it. it's easy to tell metadata from filedata on ext2.

> whether we can "guarantee that all the data blocks have been written",
> but may be missing the point I was making, namely that THE BLOCKS HAVE
> BEEN WRITTEN TO THE WRONG FILES.

remember that the blocks have previous content, and reiserfs' tails
optimization means that files appended all the time (wtmp) can move around
rapidly (at least their tail).

> P.P.S.  I say reset and not power-off, although I hope that this is
> moot, because I presume that the unsynced data, by virtue of being
> unsynced, is nowhere near the disk datapaths anyway.

this can make a big difference. many disks (ibm, maxtor) nowadays write
partial blocks on power outage, this gives "Uncorrectable read errors",
which is fatal, because no filesystem so far can work around this. It's
easy to repair (just rewrite the block), but would requite filesystem
feedback (hey, reisrefs, this metadata block is *garbage*).

> a reset should let the disks continue to write data out of their write
> buffers, assuming that a CPU reset doesn't flush such pending

they should, yes. OTOH, ide disks are cheap...

> not matter; is it common that disks might leave data buffered but
> unwritten for 30 seconds if there is no other disk activity?  I would

no. and it doesn't make sense. but it's not forbidden or sth.

-- 
     -----==-                                             |
     ----==-- _                                           |
     ---==---(_)__  __ ____  __       Marc Lehmann      +--
     --==---/ / _ \/ // /\ \/ /       pcg@goo      |e|
     -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
   The choice of a GNU generation                       |
                                                        |

- - - Separator between forwarded messages - - -

Date: Sun, 30 Sep 2001 21:00:49 -0400 (EDT)
From: foner-reiserfs@med
To: pcg@goo
Subject: [reiserfs-list] ReiserFS data corruption in very simple configuration
CC: sct@red, Nikita@Nam, Mason@Sus,
        linux-kernel@vge, reiserfs-list@Nam

    Date: Sat, 29 Sep 2001 14:52:29 +0200
    From: <pcg@goo ( Marc) (A.) (Lehmann )>

Thanks for your response!  Bear with me, though, because I'm asking
a design question below that relates to this.

    On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner <foner-reiserfs@med> wrote:
    > isn't fixed by fsck?  [See outcome (d) below.]  I'm having difficulty
    > believing how this can be possible for a non-journalling filesystem.

    If you have difficulties in believing this, may I ask you how you think it
    is possible for a non-journaling filesystem to prevent this at all?

Naively, one would assume that any non-journalling FS that has written
correct metadata through to the disk would either have written updates
into files, or failed to write them, but would not have written new
(<60 second old) data into different files than the data was destined for.
(I suppose the assumption I'm making here is that, when creating or
extending a file, the metadata is written -last-, e.g., file blocks
are allocated, file data is written, and -then- metadata is written.
That way, a failure anywhere before finality simply seems to vanish,
whereas writing metadata first seems to cause the lossage below.)

   > But what about written to the wrong files?  See below.

   What you see is most probably *old* data, not data from another (still
   existing) file.

I'm...  dubious, but maybe.  As mentioned earlier in this thread,
one of the failures I saw consisted of having several lines of my
XFree86.0.log file appended to wtmp---when I logged in after the
failure, I got "Last login: " followed by several lines from that file
instead of a date.  (Other failures scrambled other files worse.)

Now, it's -possible- that rsfs allocated an extra portion to the end
of wtmp for the last-login data (as a user of the fs, I don't care
whether officially this was a "block", an entry in a journal, etc),
login "wrote" to that region (but it wasn't committed yet 'cause no
sync), my XFree86.0.log file was "created" and "written" (again
uncommitted), I pushed reset, and then when it came back up, the end
of wtmp had data from the -previous- copy of XFree86.0.log that had
been freed (because it was unlinked when the next copy was written)
but which had not actually had the wtmp data written to it yet
(because a sync hadn't happened).  I have no way to verify this, since
one XFree86.0.log looks much like the other.  Conceptually, this would
imply that wtmp was extended into disk freespace, which just happened
to have that logfile in it (instead of zero bytes).  Is this what
you're talking about when you say "*old* data"?  I think so, and that
seems to match your comment below about file tails moving around
rapidly.

But it doesn't explain -why- it works this way in the first place.
Wouldn't it make more sense to commit metadata to disk -after- the
data blocks are written?  After all, if -either one- isn't written,
the file is incomplete.  But if the metadata is written -last-, the
file simply looks like the data was never added.  If the metadata is
written -first-, the file can scoop up random trash from elsewhere in
the filesystem.  I contend that this is -much- worse, because it can
render a previously-good file completely unparseable by tools that
expect that -all- of the file is in a particular syntax.  It's just
an accident, I guess, that login will accept any random trash when
it prints its "last-login" message, rather than falling over with a
coredump because it doesn't look like a date.  [And see * below.]

Unfortunately, this behavior meant that X -did- fall over, because my
XF86Config file was trashed by being scrambled---I'd recently written
out a new version, after all---and the trashed copy no longer made any
sense.  I would have been -much- happier to have had the -unmodified-,
-old- version than a scrambled "new" version!  Without Emacs ~ files,
this would have been much worse.  Consider an app that, "for reliability",
rewrites a file by creating a temp copy, writing it out, then renaming
the temp over the original [this is how Emacs typically saves files].
But if you write the metadata first, you foil this attempt to be safe,
because you might have this sequence at the actual disk:  [magnetic
oxide updated w/rename][start updating magnetic oxide with tempfile
data][power failure or reset]---ooops! original file gone, new file
doesn't have its data yet, so sorry, thanks for playing. 

By writing metadata first, it seems that reiserfs violates the
idempotence of many filesystem operations, and does exactly the
opposite of what "journalling" implies to anyone who understands
databases, namely that either the operation completes entirely, or it
is completely undone.  Yes, yes, I know (now!) that it claims to only
journal the metadata, but how does this help when what it's essentially
doing is trashing the -data- in unexpected ways exactly when such
journalling is supposed to help, namely across a machine failure?

This seems like such an elementary design defect that I'm at a loss
to understand why it's there.  There -must- be some excellent reason,
right?  But what?  And if not, can it be fixed?

I'm also still waiting to find out how to make reiserfs actually
journal its data, and what the performance implications of this are.
No one has responded with a URL.

[*] It's also a security hole.  If I want to read a file that I'm not
authorized to read, -but- I can cause a kernel panic (or a blackout!),
then I can craftily wait until up to several seconds after the
"secure" file is being rewritten (presumably via the write-tempfile-
and-relink method), create a big file of my own, and force the
panic---my file may then get some of the secure blocks from the old
copy.  And, unlike filesystems that write metadata last, the "secure"
program can't just zero out the blocks of the file it's about to
unlink, because---since metadata is written first---those zeroes won't
have made it to disk yet even though the blocks have been declared
free and included in my file.  I now know what's in your file.
Whoops.  And this is such an enormous timing hole that I can write a
program that just checks every 5 seconds or so for a new copy of the
secure file, -then- forces the failure---I need not get the timing
very good, as long as it's likely that I'll do so before the next
sync.  It's so bad that, even if I can't force a panic, my program
can just beep and I'll immediately go short out the outlet that
happens to be on the same circuit as the machine I'm attacking.

    [ . . . ]

   > (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B.

   this shouldn't happen on reiserfs. however, the unwritten parts of file a  can easily
   contain data formerly in file b.

Then why allow metadata to be written first instead of last?

   > (e) Metadata corrupted in some fashion, file data undefined.
   >     ("Undefined" means could be any of (a) through (d) above; I don't care.)

   this should be prevented by journaling (of course, this won't help against
   harddisk failures) on reiserfs. ext2 often has this problem, but fsck usually
   can repair it. it's easy to tell metadata from filedata on ext2.

   > whether we can "guarantee that all the data blocks have been written",
   > but may be missing the point I was making, namely that THE BLOCKS HAVE
   > BEEN WRITTEN TO THE WRONG FILES.

   remember that the blocks have previous content, and reiserfs' tails
   optimization means that files appended all the time (wtmp) can move around
   rapidly (at least their tail).

   [ . . . ]

- - - Separator between forwarded messages - - -

Date: Mon, 1 Oct 2001 03:26:27 +0200
From: <pcg@goo ( Marc) (A.) (Lehmann )>
To: foner-reiserfs@med
Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple  configuration
Cc: sct@red, Nikita@Nam, Mason@Sus,
       linux-kernel@vge, reiserfs-list@Nam
Mail-Followup-To: foner-reiserfs@med, sct@red,
       Nikita@Nam, Mason@Sus, linux-kernel@vge,
       reiserfs-list@Nam
References: <20010929145229.C26231@sch>
<200110010100.VAA07189@out-of-band.media.mit.edu>
X-Operating-System: Linux version 2.4.8-ac9 (root@cer) (gcc version 3.0.1) 

On Sun, Sep 30, 2001 at 09:00:49PM -0400, foner-reiserfs@med wrote:
> extending a file, the metadata is written -last-, e.g., file blocks
> are allocated, file data is written, and -then- metadata is written. 

this is almost impossible to achieve with existing hardware (witness the
many discussions about disk caching for example), and, without journaling,
might even be slow.

> of wtmp had data from the -previous- copy of XFree86.0.log that had
> been freed (because it was unlinked when the next copy was written)
> but which had not actually had the wtmp data written to it yet

It's easily possible, but it could also be a bug. Let's the reiserfs authors
decide.

However, if it is indeed "a bug" then fixing it would only lower the
frequency of occurance.

Only ext3 (some modes) + turning off your harddisk's cache can ensure
this, at the moment.

> to have that logfile in it (instead of zero bytes).  Is this what
> you're talking about when you say "*old* data"?  I think so, and that
> seems to match your comment below about file tails moving around
> rapidly.

appending to logfiles will result in a lot of movement. with other,
strictly block-based filesystems this occurs relatively frequent, and data
will not usually move around. with reiserfs tail movement is frequent.

> Wouldn't it make more sense to commit metadata to disk -after- the
> data blocks are written?

The problem is that there is currently no easy way to achieve that.

> file simply looks like the data was never added.  If the metadata is
> written -first-, the file can scoop up random trash from elsewhere in

Also, this is not a matter of metadata first or last. Sometimes you need
metadata first, sometimes you need it last. And in many cases, "metadata"
does not need to change, while data still changes.

> the filesystem.  I contend that this is -much- worse, because it can
> render a previously-good file completely unparseable by tools that
> expect that -all- of the file is in a particular syntax.

It depends - with ext2 you frequently have garbled files, too. Basically, if
you write to a file and turn off the power the outcome is unexpected, and
will always be (unless you are ready to take the big speed hit).

> Unfortunately, this behavior meant that X -did- fall over, because my
> XF86Config file was trashed by being scrambled---I'd recently written
> out a new version, after all---and the trashed copy no longer made any

But the same thing can and does happen with ext2, depending on your editor
and your timing. It is not a reiserfs thing.

> But if you write the metadata first, you foil this attempt to be safe,
> because you might have this sequence at the actual disk:  [magnetic
> oxide updated w/rename][start updating magnetic oxide with tempfile
> data][power failure or reset]---ooops! original file gone, new file
> doesn't have its data yet, so sorry, thanks for playing.

Of course. If you want data to hit the disk, you have to use fsync. This
does work with reiserfs and will ensure that the data hits the disk. If
you don't do this then bad things might happen.

> By writing metadata first, it seems that reiserfs violates the
> idempotence of many filesystem operations, and does exactly the
> opposite of what "journalling" implies to anyone who understands
> databases, namely that either the operation completes entirely, or it
> is completely undone.

You are confusing databases with filesystems, however. Most journaling
filesystems work that way. Some (like ext3) are nice enough to let you
choose.

> journal the metadata, but how does this help when what it's essentially
> doing is trashing the -data- in unexpected ways exactly when such
> journalling is supposed to help, namely across a machine failure?

But ext2 works in the same way. It does happen more often with reiserfs
(especially with tails), but ignoring the problem for ext2 doesn't make it
right. If applications don't work reliably with reisrefs, they don't work
reliably with ext2. If you want reliability then mount synchronous. 

> This seems like such an elementary design defect that I'm at a loss
> to understand why it's there. 

About every filesystem does have this "elementary design defect". If you
want data to hit the disk, sync it. Its that simple.

> There -must- be some excellent reason,
> right?  But what?  And if not, can it be fixed?

Speed is an excellent reason. The fix is to tell the kernel to write the
data out to the platters.

Anyway, this is a good time to review the various discussions on the
reiserfs list and the kernel list on how to teach the kernel (if it is
possible) to implement loose write-ordering.

-- 
     -----==-                                             |
     ----==-- _                                           |
     ---==---(_)__  __ ____  __       Marc Lehmann      +--
     --==---/ / _ \/ // /\ \/ /       pcg@goo      |e|
     -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
   The choice of a GNU generation                       |
                                                        |

- - - Separator between forwarded messages - - -

Date: Sun, 30 Sep 2001 22:32:47 -0400 (EDT)
From: foner-reiserfs@med
To: pcg@goo
Subject: [reiserfs-list] ReiserFS data corruption in very simple configuration
CC: sct@red, Nikita@Nam, Mason@Sus,
       linux-kernel@vge, reiserfs-list@Nam

   Date: Mon, 1 Oct 2001 03:26:27 +0200
   From: <pcg@goo ( Marc) (A.) (Lehmann )>

   On Sun, Sep 30, 2001 at 09:00:49PM -0400, foner-reiserfs@med wrote:
   > extending a file, the metadata is written -last-, e.g., file blocks
   > are allocated, file data is written, and -then- metadata is written.

   this is almost impossible to achieve with existing hardware (witness the
   many discussions about disk caching for example), and, without journaling,
   might even be slow.

I think perhaps we may be talking past each other; let me try to clarify.

As I said earlier in this thread, this has nothing at all to do with
disk caching.  Let me restate this again:  The scenario I'm discussing
is an otherwise-idle machine that had 2 (maybe 3) files modified, sat
idle for 30-60 seconds, and then had the reset button pushed.  I would
expect that either file data and metadata got written, or neither got
written, but not metadata without file data.  This is repeatable more
or less at will---I didn't -just- happen to catch it -just- as it
decided to frob the disks.  Instead, the problem seems to be that
reiserfs is perfectly happy to update the on-disk representation of
which disk blocks contain which files' data, and then -sit there- for
a long time (a minute? longer?) without -also- attempting to flush the
file data to the disk.  This then leads to corrupted files after the
reset.  It's not that the CPU sent data to the disk subsystem that
failed to be written by the time of the interruption; it's that the
data was still sitting in RAM and the CPU hadn't even decided to get
it out the IDE channel yet.  This means that there is -always- a giant
timing hole which can corrupt data, as opposed to just the much-tinier
hole that would be created if the file-bytes-to-disk-bytes correspondence
were updated immediately after the write that wrote the data---it
would be hard for me to accidentally hit such a hole.

   > of wtmp had data from the -previous- copy of XFree86.0.log that had
   > been freed (because it was unlinked when the next copy was written)
   > but which had not actually had the wtmp data written to it yet

   It's easily possible, but it could also be a bug. Let's the reiserfs authors
   decide.

   However, if it is indeed "a bug" then fixing it would only lower the
   frequency of occurance.

True, but as long as it makes it only happen if the disk is -in
progress of writing stuff- when the reset or power failure happens,
the risk is -greatly- reduced.  Right now, it's an enormous timing
hole, and one that's likely to be hit---it's happened to me -every
single time- I've had to hit the reset button because (for example)
I wedged X while debugging, and even if I waited a minute after the
wedge-up to do so!  The way I've avoided it is by running a job that
syncs once a second while doing debugging that might possibly make me
unable to take the machine down cleanly.  This is a disgusting and
unreliable kluge.

   Only ext3 (some modes) + turning off your harddisk's cache can ensure
   this, at the moment.

Or ext3 (some modes) + assuming that the disk will at least write data
that's been sent to it, even if the CPU gets reset.  (I know it's
hopeless if power fails, but that can be made arbitrarily unlikely,
compared to a kernel panic or having to do a CPU reset.)

   > to have that logfile in it (instead of zero bytes).  Is this what
   > you're talking about when you say "*old* data"?  I think so, and that
   > seems to match your comment below about file tails moving around
   > rapidly.

   appending to logfiles will result in a lot of movement. with other,
   strictly block-based filesystems this occurs relatively frequent, and data
   will not usually move around. with reiserfs tail movement is frequent.

Right.

   > Wouldn't it make more sense to commit metadata to disk -after- the
   > data blocks are written?

   The problem is that there is currently no easy way to achieve that.

Why not?  (Ignore the disk-caching issue and concentrate on when the
kernel asks for data to be written to the disk.  I am -assuming that
the kernel either (a) writes the data in the order requested, or at
least (b) once it decides to write anything, keeps sending it to the
disk until its queue is completely empty.)

   > file simply looks like the data was never added.  If the metadata is
   > written -first-, the file can scoop up random trash from elsewhere in

   Also, this is not a matter of metadata first or last. Sometimes you need
   metadata first, sometimes you need it last. And in many cases, "metadata"
   does not need to change, while data still changes.

I'm using "metadata" here as a shorthand for "how the filesystem knows
which byte on disk corresponds to which byte in the file", not just
things like atime, ctime, etc.

   > the filesystem.  I contend that this is -much- worse, because it can
   > render a previously-good file completely unparseable by tools that
   > expect that -all- of the file is in a particular syntax.

   It depends - with ext2 you frequently have garbled files, too. Basically, if
   you write to a file and turn off the power the outcome is unexpected, and
   will always be (unless you are ready to take the big speed hit).

   > Unfortunately, this behavior meant that X -did- fall over, because my
   > XF86Config file was trashed by being scrambled---I'd recently written
   > out a new version, after all---and the trashed copy no longer made any

   But the same thing can and does happen with ext2, depending on your editor
   and your timing. It is not a reiserfs thing.

Well, I've gotten several pieces of private mail from people
complaining that it's happening a lot more with reiserfs.  And
I've never been bitten this way in years of ext2 usage.

   > But if you write the metadata first, you foil this attempt to be safe,
   > because you might have this sequence at the actual disk:  [magnetic
   > oxide updated w/rename][start updating magnetic oxide with tempfile
   > data][power failure or reset]---ooops! original file gone, new file
   > doesn't have its data yet, so sorry, thanks for playing.

   Of course. If you want data to hit the disk, you have to use fsync. This
   does work with reiserfs and will ensure that the data hits the disk. If
   you don't do this then bad things might happen.

It's that I either want the data to hit the disk, or -not- to hit
the disk, but not to partially-update files such that things are
inconsistent even when the disk has been idle for 20 seconds
and the system isn't doing anything else.  It's even worse in
that the filesystem -believes- itself to be accurate, even though
the data it's actually storing is scrambled.

   > By writing metadata first, it seems that reiserfs violates the
   > idempotence of many filesystem operations, and does exactly the
   > opposite of what "journalling" implies to anyone who understands
   > databases, namely that either the operation completes entirely, or it
   > is completely undone.

   You are confusing databases with filesystems, however. Most journaling
   filesystems work that way. Some (like ext3) are nice enough to let you
   choose.

I am deliberately talking about databases, because the terminology and
technology of journalling came from there.  Using the term "journalling"
and then behaving very differently from the way it's used in database
design is misleading at best.  Several people who've written to me
have said they felt "cheated" to discover that reiserfs didn't
actually journal the data or otherwise misbehaved in ways similar
to my problem here.

- - - Separator between forwarded messages - - -

Date: 	Mon, 1 Oct 2001 05:43:08 -0400
From: Chris Siebenmann <cks@utc>
To: foner-reiserfs@med 
Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple   configuration
X-Newsgroups: mail.linux.kernel
Organization: Ziebmef home away from home

You write:
| But it doesn't explain -why- it works this way in the first place.
| Wouldn't it make more sense to commit metadata to disk -after- the
| data blocks are written? [...]

 A vaguely naieve viewpoint:

 It depends on what you are maximizing, and it depends on what sort of
tricks you have to play to achieve this. Writing metadata rapdily means
that the filesystem is 'stable' rapidly, and anything that asks for
such stability can be told 'it's okay' similarly rapidly. I don't know
if Reiserfs guarantees that operations like rename() are synchronous
on disk (only returns to user level once the rename has comitted), but
if it is it has a motive for making that as fast as possible.

 Doing this probably creates some interesting ordering dependencies
in extreme cases that are not as simple as 'write data blocks before
writing the pending journal transactions'. Imagine deleting a file and
then immediately wanting to reuse the blocks for new data in another
file -- you must insure that the data blocks are *not* written before
the delete commits in the journal, so you can't just do 'write all
related data blocks just before a journal commit'.

---
"I shall clasp my hands together and bow to the corners of the world."
           Number Ten Ox, "Bridge of Birds"
cks@utc		   				    utgpu!cks

- - - Separator between forwarded messages - - -

Date: Mon, 1 Oct 2001 12:30:17 +0100
From: "Stephen C. Tweedie" <sct@red>
To: Lenny Foner <foner-reiserfs@med>
Subject: Re: ReiserFS data corruption in very simple configuration
Cc: sct@red, linux-kernel@vge, reiserfs-list@Nam
References: <20010925142854.A5384@red>  <200109290444.AAA19624@out-of-band.media.mit.edu>

Hi,

On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner wrote:

>     Not true.  ext2, ext3 in its "data=writeback" mode, and reiserfs can
>     all demonstrate this behaviour.  Reiserfs is being no worse than ext2
>     (the timings may make the race more or less likely in reiserfs, but
>     ext2 _is_ vulnerable.)
> 
> ext2fs can write parts of file A to file B, and vice versa, and this
> isn't fixed by fsck?

No, we're not talking about incorrect writes, but *incomplete* writes,
which is a totally different thing.  An ext2 write of new data
involves many steps: the inode needs to be written to mark the file's
new size, the indirect mapping block[s] may have to be written to
record where the data is, and the data blocks themselves need to be
written. 

Not only that, but a delete also requires multiple writes.  If you
delete a file and rapidly create a new one, then the image of the
filesystem in cache remains totally consistent, but the copy on disk
is updated incrementally and if you crash before the entire image is
updated, you can end up seeing both bits of the old file that was in
the process of being deleted, and the new file that was being created.

In addition, journaling prevents metadata inconsistencies from
occuring due to incomplete writes, but on its own, metadata journaling
doesn't mean that the data blocks are also in sync --- the disk blocks
describing a new file might be on disk, but the data blocks that the
file contains might not be.  Reiserfs, and also ext3 in its fastest
"writeback" mode, both behave like this (but ext3's other modes order
data writes so that this situation never happens: data blocks are
always flushed to disk before the metadata is committed.)

>     e2fsck only restores metadata consistency on ext2 after a crash: it
>     can't possibly guarantee that all the data blocks have been written.
> 
> But what about written to the wrong files?  See below. 

See above.  If all the metadata is intact, how can e2fsck *possibly*
detect whether a data block contains the old or the new contents of
the block?

> Let's take this scenario:  Files A and B have had blocks written to
> them sometime in the recent past (30 to 60 seconds or so) and a sync
> has not happened yet.  (I don't know how often reiserfs will be synced
> by default; 60 seconds?  Longer?  Presumably running "sync" will force
> it, but I don't know when else it will happen.)  File A may have been
> completely rewritten or newly written (e.g., what Emacs does when it
> saves a file), whereas file B may have simply been appended to (e.g.,
> what happens when wtmp is updated).
> 
> The CPU reset button is then pushed.  [See P.P.S. at end of this message.]
> 
> Now, we have the following possibilities for the outcome after the
> system comes back up and has finished checking its filesystem:
> 
> (a) Metadata correctly written, file data correctly written.
> (b) Metadata correctly written, file data partially written.
>     (E.g., one or both files might have been partially or completely
>     updated.) 
> (c) Metadata correctly written, file data completely unwritten.
>     (Neither file got updated at all.)
> (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B.
>     (E.g., File A gets some of file B written somewhere within it,
>     and file B gets some of file A written somewhere within it---this
>     is the behavior I observed, at least twice, with reiserfs.)
> (e) Metadata corrupted in some fashion, file data undefined.
>     ("Undefined" means could be any of (a) through (d) above; I don't care.)
> 
> Now, which filesystems can show each outcome?  I don't know.  I
> contend that reiserfs does (d).  Stephen Tweedie talks above about
> whether we can "guarantee that all the data blocks have been written",
> but may be missing the point I was making, namely that THE BLOCKS HAVE
> BEEN WRITTEN TO THE WRONG FILES. 

For ext3, (d) will never happen in this case.  You can only get
"wrong" data blocks if one of the files is being *deleted*, and its
blocks have been allocated to a new file, and the handover of those
blocks is incomplete at the time of the crash.

ext3 will only give you (a) (both metadata and data correctly written)
or (f) (neither have yet been written at all) if it is running in
ordered or data-journaling mode.  (b) and (c) are possible only if you
are in writeback mode.  (d) and (e) never happen if you're creating
two files, although in writeback mode (d) is possible if, say, you are
deleting A and writing B at the same time (the other ext3 modes
prevent this scenario too.)

Cheers,
 Stephen

- - - Separator between forwarded messages - - -

Date: Mon, 01 Oct 2001 19:27:31 +0400
From: Hans Reiser <reiser@nam>
To: foner-reiserfs@med
Subject: Re: ReiserFS data corruption in very simple configuration
CC: linux-kernel@vge 

This is the meaning of metadata journaling: that writes in progress at the time
of the crash may write garbage, but you won't need to fsck.  You can get this
behaviour with other filesystems like FFS also.  If you cannot accept those
terms of service, you might use ext3 with data journaling on, but then your
performance will be far worse.  It is a tradeoff, not a bug.  Regarding where to
email these types of reiserfs questions, you might email
reiserfs-list@nam with such questions, or try
www.namesys.com/support.html if you want paid support service on it.

Best,

Hans 

foner-reiserfs@med wrote:
> 
> [Please CC me on any replies; I'm not on linux-kernel.]
> 
> The ReiserFS that comes with both Mandrake 7.2 and 8.0 has
> demonstrated a serious data corruption problem, and I'd like
> to know (a) if anyone else has seen this, (b) how to avoid it,
> and (c) how to determine how badly I've been bitten.
> 
> My configuration in each case has been an AMD CPU running ReiserFS
> exactly as configured "out of the box" by running the Mandrake 7.2 or
> 8.0 installation CD and opting to run ReiserFS instead of the default.
> This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID
> or anything fancy like that.  The hardware itself is rock solid and
> has never demonstrated any faults at all.  (MDK 8.0 appears to use
> RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.)
> The machine had barely been used before each corruption problem; I'm
> not running some strange root-priv stuff, and each time, the FS hadn't
> had more than a few minutes to a few hours of use since being created.
> 
> In each case, I've gotten in trouble by editing my XF86Config-4 file,
> guessing wrong on a modeline, hanging X (blank gray screen & no
> response to anything), and being forced to hit the reset button
> because nothing else worked.  Under 7.2, I discovered that my
> XF86Config-4 file suddenly had a block of nulls in it.  That time, I
> thought I must have been hallucinating, but I ran a background job to
> sync the filesystem every second while continuing to debug the X
> problems, and didn't see the corruption again.
> 
> Now, I was just bitten by the -same- behavior under MDK 8.0.  After
> accidentally hanging X, I waited a few seconds just in case a sync was
> pending, hit reset, and had all sorts of lossage:
>   (1) Parts of the XF86Conf-4 file had lines garbled, e.g.,
>       sections of the file had apparently been rearranged.
>   (2) /var/log/XFree86.0.log was truncated, and maybe garbled.
>   (2) Logging in as root was fine, but then logging in as myself
>       I got "Last login: <4-5 lines of my XFree86.0.log file (!)>"
>       instead of a date!  Logging in again gave me the proper
>       last-login time, but clearly wtmp or something else had
>       gotten stepped on in some weird way.
> Obviously, the behavior I saw once under MDK 7.2 was no hallucination
> or accidental yank in Emacs.
> 
> I thought the whole point of a journalling file system was to
> -prevent- corruption due to an unexpected failure!  This seems to be
> -far- worse than a normal filesystem---ext2fs would at least choke and
> force fsck to be run, which might actually fix the problem, but this
> is ridiculous---it just silently trashes random files.
> 
> So I now have possibly-undetected filesystem damage.  My -guess- is
> that only files written within a few minutes of the reset are likely
> to be affected, but I really don't know, and don't know of a good way
> to find out.  Must I reinstall the OS -again-, starting from a blank
> partition, to be sure?  Maybe I should just give up on ReiserFS completely.
> 
> [If there is a more-appropriate place for me to send this---such as
> a particular Mandrake list, or a particular ReiserFS list---please let
> me know, particularly if I can get a quick answer -without- going
> through the overhead of subscribing to the list, being flooded, and
> unsubscribing---that's what archives are for.  Some websearching
> for "ReiserFS corruption" yielded -thousands- of hits---not a good
> sign---and a very large proportion of them were on this list, so I
> figure this is as good a place to ask as any.  Thanks again.]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vge
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

- - - Separator between forwarded messages - - -

Date: Wed, 3 Oct 2001 17:17:03 +0100
From: "Stephen C. Tweedie" <sct@red>
To: Hans Reiser <reiser@nam>
Subject: Re: ReiserFS data corruption in very simple configuration
Cc: foner-reiserfs@med, linux-kernel@vge,
       Stephen Tweedie <sct@red>
References: <200109221000.GAA11263@out-of-band.media.mit.edu> <3BB88B63.AEE6EF8E@nam>

Hi,

On Mon, Oct 01, 2001 at 07:27:31PM +0400, Hans Reiser wrote:
> This is the meaning of metadata journaling: that writes in progress at the  time
> of the crash may write garbage, but you won't need to fsck.  You can get this
> behaviour with other filesystems like FFS also.  If you cannot accept those
> terms of service, you might use ext3 with data journaling on, but then your
> performance will be far worse.

ext3 with ordered data writes has performance nearly up to the level
of the fast-and-loose writeback mode for most workloads, and still
avoids ever exposing stale disk blocks after a crash.

Sure, it's a tradeoff, but there are positions between the two
extremes (totally unordered data writes, and totally journaled data
writes) which offer a good compromise here.

Cheers,
Stephen

- - - Separator between forwarded messages - - -

Date: Wed, 03 Oct 2001 17:28:13 +0100
From: Toby Dickenson <tdickenson@dev>
To: pcg@goo
Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple     configuration
Cc: foner-reiserfs@med, sct@red, Nikita@nam,
       Mason@Sus, linux-kernel@vge,
       reiserfs-list@nam
Reply-To: tdickenson@gem
References: <20010929145229.C26231@sch>
<200110010100.VAA07189@out-of-band.media.mit.edu> 
<20011001032627.A9991@sch>

>Of course. If you want data to hit the disk, you have to use fsync. This
>does work with reiserfs and will ensure that the data hits the disk. If
>you don't do this then bad things might happen.

This is probably a naive question, but this thread has already proved
me wrong on one naive assumption.....

If the sequence is:
1. append some data to file A
2. fsync(A)
3. append some further data to A
4. some writes to other files
5. power loss

Is it guaranteed that all the data written in step 1 will still be
intact?

The potential problem I can see is that some data from step 1 may have
been written in a tail, the tail moves during step 3, and then the
original tail is overwritten before the new tail (including data from
before the fsync) is safely on disk.

Thanks for your help,

Toby Dickenson
tdickenson@gem

- - - End of forwarded messages - - -

An alternative view, by Keith Lofstrom

Data corruption during a power fail is an important issue. The dates in this collection of emails are all from 2001, and no version is mentioned for most of these problems. Reiserfs has gone through many versions and changes, and may or may not suffer from the same problems and bugs. A more up to date version of this discussion, and a summary of issues as they affect dirvish data files in a multiple-hard-linked rsync repository, would be helpful.

Personally, I would not run reiserfs as a random access main file system, for some of the reasons buried in that very long message. However, most issues do not apply to rsync generated data files the way dirvish uses it. The only files likely to be corrupted during a power failure would be part of a failed image, and thus inconsequential. The important thing is protecting file system pointers and metadata, and if I read the above correctly those are properly preserved through failures even for the early (and unnamed) versions of reiserfs that are being castigated here.

There is the worry that reiserfs will stuff a fragment of a new file onto the tail of a much older file, and thus corrupt an existing backup, but I would guess that all those old tails are long since filled up with little directory snippits. An expire may leave a lot of holes, though.

I use reiserfs-3 for my dirvish hard drives because it makes effective use of the disk space. Ext3 is horribly inefficient for that, most particularly because of the way it uses fixed-sized inode tables and wastes space for small files. With a rsync repository, the usual result is a target drive that fills up far too rapidly, maxing out at 70% usage when the inode table fills, or when all those tiny little directories and files that rsync chew up the available data space, one whole disk block to store a few dozen bytes. To compensate for that, many dirvish users must do frequent and deep expires, exposing the data structures on the disk to far more write activity than simple accumulative operation.

I typically back up around 100GB of data daily, and get around 150 non-expired images on a 250GB target drive before I retire it. I do not need to do expires. To mitigate the chances of disk failure (it has happened once) I do a rotating swap of 3 drives, so even if I lose the drive in the machine I still have recent backups on two other drives. This is affordable with reiserfs because I can get many more images on a drive. When I used ext3 for target drives, before I switched to reiserfs, I got a much less usage out of the target drives.

I also would not use reiserfs on really ancient hard drives. Disk drives have cache buffers, which need to be completely written out to disk, and the heads parked, immediately after a power failure. When newer drives sense a power failure, they use the energy stored in the turning spindle to power the drive long enough to write out the cache buffers and park the heads. Older drives do not do this, and partially written sectors can result, or sectors partly written in an unexpected order. Ext3 is more robust, so it is more likely to tolerate this kind of abuse. Reiser is more likely to corrupt data in these circumstances. So don't use reiserfs on old drives! I would guess that any drive with capacities of greater than 100GB will write out its entire cache properly before halting.

So it is a tradeoff, both ext3 and reiserfs have problems, and I find the problems with reiserfs less troublesome than ext3. Other dirvish users will disagree, and so the best thing for everyone is to report your empirical experience with file systems, used as dirvish banks, to the mailing list and to the wiki.

Ideally, someone will invent a file system that allocates disk space efficiently like reiserfs, without inode limitations and wasted space, and that also treats data more securely during power failures like ext3. I hope some of you are on the lookout for this.

Keith Lofstrom 2006 Sept 8