Reiser File System Considered Harmful
Reprint from http://foner.www.media.mit.edu/people/foner/Sys/reiserfs-considered-harmful.txt
DO NOT USE REISERFS IF YOU VALUE DATA INTEGRITY.
Here's a piece of mail I wrote summarizing why not:
- I have an 80K file of why Reiserfs is evil and should not be used. Really, I should stick it on the web. Hans told me personally (online, so I have it quoted) that RSFS is optimized for speed, -not- correctness. (This started when I discovered that having the machine reset or lose power without syncing the FS would put random bits of open files in each other, so, e.g., my wtmp had four lines of the XFree86-config file I was editing at the time, and X wouldn't start 'cause the config file had a piece of yet a -third- file smashed into the middle of it. Reiser said that this is -correct- behavior---who cares if the -data- is wrong if the filesystem -structure- is correct?) -I-, on the other hand, say that a filesystem that corrupts data and then -claims to be 100% intact- (e.g., fsck says it's fine) is no file system at all, since there's no way to even -tell- things are trashed without checking every file with your backups. I threw RSFS out the window, am using ext3fs, and never looked back.
...and here's the documentation I refer to above. Note that the email addresses have been sanitized as a spam-prevention measure, and some of the messages have been omitted altogether if I wasn't sure they were public.
[LATE BREAKING NEWS: See also http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc for even more goodies on why you shouldn't ever be trusting your data to ReiserFS...]
- - - Begin forwarded messages - - -
- Date: Sat, 22 Sep 2001 06:00:43 -0400 (EDT) From: foner-reiserfs@med To: linux-kernel@vge Subject: ReiserFS data corruption in very simple configuration [Please CC me on any replies; I'm not on linux-kernel.] The ReiserFS that comes with both Mandrake 7.2 and 8.0 has demonstrated a serious data corruption problem, and I'd like to know (a) if anyone else has seen this, (b) how to avoid it, and (c) how to determine how badly I've been bitten. My configuration in each case has been an AMD CPU running ReiserFS exactly as configured "out of the box" by running the Mandrake 7.2 or 8.0 installation CD and opting to run ReiserFS instead of the default. This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID or anything fancy like that. The hardware itself is rock solid and has never demonstrated any faults at all. (MDK 8.0 appears to use RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.) The machine had barely been used before each corruption problem; I'm not running some strange root-priv stuff, and each time, the FS hadn't had more than a few minutes to a few hours of use since being created. In each case, I've gotten in trouble by editing my XF86Config-4 file,
guessing wrong on a modeline, hanging X (blank gray screen & no response to anything), and being forced to hit the reset button because nothing else worked. Under 7.2, I discovered that my XF86Config-4 file suddenly had a block of nulls in it. That time, I thought I must have been hallucinating, but I ran a background job to sync the filesystem every second while continuing to debug the X problems, and didn't see the corruption again. Now, I was just bitten by the -same- behavior under MDK 8.0. After accidentally hanging X, I waited a few seconds just in case a sync was pending, hit reset, and had all sorts of lossage:
- (1) Parts of the XF86Conf-4 file had lines garbled, e.g.,
- sections of the file had apparently been rearranged.
I got "Last login: <4-5 lines of my XFree86.0.log file (!)>" instead of a date! Logging in again gave me the proper last-login time, but clearly wtmp or something else had gotten stepped on in some weird way.
- (1) Parts of the XF86Conf-4 file had lines garbled, e.g.,
- - - Separator between forwarded messages - - -
- Date: Sat, 22 Sep 2001 16:47:31 +0400
From: Nikita Danilov <Nikita@Nam> To: foner-reiserfs@med Subject: Re: ReiserFS data corruption in very simple configuration Cc: linux-kernel@vge,
Reiserfs mail-list <Reiserfs-List@Nam>
References: <200109221000.GAA11263@out-of-band.media.mit.edu> foner-reiserfs@med writes:
> [Please CC me on any replies; I'm not on linux-kernel.] > > The ReiserFS that comes with both Mandrake 7.2 and 8.0 has > demonstrated a serious data corruption problem, and I'd like > to know (a) if anyone else has seen this, (b) how to avoid it, > and (c) how to determine how badly I've been bitten. > > My configuration in each case has been an AMD CPU running ReiserFS > exactly as configured "out of the box" by running the Mandrake 7.2 or > 8.0 installation CD and opting to run ReiserFS instead of the default. > This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID > or anything fancy like that. The hardware itself is rock solid and > has never demonstrated any faults at all. (MDK 8.0 appears to use > RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.) > The machine had barely been used before each corruption problem; I'm > not running some strange root-priv stuff, and each time, the FS hadn't > had more than a few minutes to a few hours of use since being created. > > In each case, I've gotten in trouble by editing my XF86Config-4 file, > guessing wrong on a modeline, hanging X (blank gray screen & no > response to anything), and being forced to hit the reset button > because nothing else worked. Under 7.2, I discovered that my > XF86Config-4 file suddenly had a block of nulls in it. That time, I > thought I must have been hallucinating, but I ran a background job to > sync the filesystem every second while continuing to debug the X > problems, and didn't see the corruption again. > > Now, I was just bitten by the -same- behavior under MDK 8.0. After > accidentally hanging X, I waited a few seconds just in case a sync was > pending, hit reset, and had all sorts of lossage: > (1) Parts of the XF86Conf-4 file had lines garbled, e.g., > sections of the file had apparently been rearranged. > (2) /var/log/XFree86.0.log was truncated, and maybe garbled. > (2) Logging in as root was fine, but then logging in as myself > I got "Last login: <4-5 lines of my XFree86.0.log file (!)>" > instead of a date! Logging in again gave me the proper > last-login time, but clearly wtmp or something else had > gotten stepped on in some weird way. > Obviously, the behavior I saw once under MDK 7.2 was no hallucination > or accidental yank in Emacs. > > I thought the whole point of a journalling file system was to > -prevent- corruption due to an unexpected failure! This seems to be > -far- worse than a normal filesystem---ext2fs would at least choke and > force fsck to be run, which might actually fix the problem, but this > is ridiculous---it just silently trashes random files. Stock reiserfs only provides meta-data journalling. It guarantees that structure of you file-system will be correct after journal replay, not content of a files. It will never "trash" file that wasn't accessed at the moment of crash, though. Full data-journaling comes at cost. There
is patch by Chris Mason <Mason@Sus> to support data journaling in reiserfs. Ext3 supports it also.
> > So I now have possibly-undetected filesystem damage. My -guess- is > that only files written within a few minutes of the reset are likely > to be affected, but I really don't know, and don't know of a good way > to find out. Must I reinstall the OS -again-, starting from a blank > partition, to be sure? Maybe I should just give up on ReiserFS completely. > > [If there is a more-appropriate place for me to send this---such as > a particular Mandrake list, or a particular ReiserFS list---please let > me know, particularly if I can get a quick answer -without- going
Reiserfs mail-list <Reiserfs-List@Nam>, archive at http://marc.theaimsgroup.com/?l=reiserfs&r=1&w=2
> through the overhead of subscribing to the list, being flooded, and > unsubscribing---that's what archives are for. Some websearching > for "ReiserFS corruption" yielded -thousands- of hits---not a good > sign---and a very large proportion of them were on this list, so I > figure this is as good a place to ask as any. Thanks again.] Nikita.
- - - Separator between forwarded messages - - -
- Date: Sat, 22 Sep 2001 16:44:21 -0400 (EDT) From: foner-reiserfs@med To: Nikita@Nam Subject: ReiserFS data corruption in very simple configuration CC: linux-kernel@vge, Reiserfs-List@Nam Date: Sat, 22 Sep 2001 16:47:31 +0400
From: Nikita Danilov <Nikita@Nam> Stock reiserfs only provides meta-data journalling. It guarantees that structure of you file-system will be correct after journal replay, not content of a files. It will never "trash" file that wasn't accessed at the moment of crash, though. Thanks for clarifying this. However, I should point out that the failure mode is quite serious---whereas ext2fs would simply fail to record data written to a file before a sync, reiserfs seems to have instead -swapped random pieces of one file with another-, which is -much- harder to detect and fix. I can live with uncommitted changes, but it's hard to justify the behavior I saw---it means that any even slightly-busy machine that experiences a crash could have dozens or hundreds of files with each others' contents all mixed together---remember, parts of my XF86Config file wound up in wtmp! And both XF86Config and wtmp had been written at least 20 seconds
before I had to push the reset button, and perhaps > 30 seconds; I don't recall how often the FS is syncing by default, but it's disturbing behavior. After all, at the time I pushed reset, I had -no- files actually being written by any process under my direct control; I'd merely written one file out from Emacs under a minute earlier. I'd hate to think of what would happen if this sort of thing hit a CVS repository. This seems to outweigh the convenience of a rapid start after failure (one of the reasons I decided to try reiserfs in the first place), because a failure means possibly having to recover an entire file server from backups (hence losing -lots more- data) because you don't know which files might have been trashed if the machine loses power or the kernel panics. There's no simple test ("did my edits make it into the file?"), and no way to really know if the machine might later misbehave because critical files have been overwritten with parts of others. (This inability to easily figure out what might have been affected also means that the damage will rapidly propagate to backups, hence making the backups useless.) About the only way around it would seem to be to checksum every file in the FS at regular intervals, and rechecksum after a crash---at which point, what's the point of not having to run fsck? Is this -really- how reiserfs is supposed to behave in a crash? It seems like this should be prominently documented in the description of the file system---I know that I'm rather nervous about using it if this is true, since it turns a few minutes of fsck'ing (for ext2fs) into a restore-the-whole-file-system exercise instead. Surely that's not right. If this is really supposed to be how reiserfs behaves any time it doesn't get to sync before a machine dies on it, I can't see how it can be justified for any production use, and I'll probably have to reinstall my OS using ext2fs instead.
- Full data-journaling comes at cost. There
is patch by Chris Mason <Mason@Sus> to support data journaling in reiserfs. Ext3 supports it also. Do you have a URL for this? A search for reiserfs and mason yields 12,000 hits. (I'm particularly interested in one for reiserfs 3.6.25 and Mandrake 8.0, but I assume there may be several variants in the same repository.)
> So I now have possibly-undetected filesystem damage. My -guess- is > that only files written within a few minutes of the reset are likely > to be affected, but I really don't know, and don't know of a good way > to find out. Must I reinstall the OS -again-, starting from a blank > partition, to be sure? Maybe I should just give up on ReiserFS completely. > > [If there is a more-appropriate place for me to send this---such as > a particular Mandrake list, or a particular ReiserFS list---please let > me know, particularly if I can get a quick answer -without- going
Reiserfs mail-list <Reiserfs-List@Nam>, archive at http://marc.theaimsgroup.com/?l=reiserfs&r=1&w=2 Thanks. I saw that list before, and I'm glad that you've included it in this discussion.
- - - Separator between forwarded messages - - -
- Date: Sat, 22 Sep 2001 14:02:40 -0700
From: Andrew Morton <akpm@zip> To: foner-reiserfs@med Subject: Re: ReiserFS data corruption in very simple configuration Sender: akpm@vas References: <15276.34915.301069.643178@bet> (message from Nikita Danilov on Sat, 22 Sep 2001 16:47:31 +0400) <200109222044.QAA11638@out-of-band.media.mit.edu>
The default journalling mode for ext3 will write data before committing metadata. So this will never happen with ext3. Guaranteed.
- - - Separator between forwarded messages - - -
- Date: Sat, 22 Sep 2001 18:07:10 -0400 (EDT)
From: Lenny Foner <foner@med> To: akpm@zip Subject: ReiserFS data corruption in very simple configuration Date: Sat, 22 Sep 2001 14:02:40 -0700
From: Andrew Morton <akpm@zip>
The default journalling mode for ext3 will write data before committing metadata. So this will never happen with ext3. Guaranteed. That's good to know. (How robust is this against the sort of caching that typically goes on in disk drives, btw?) I haven't checked---is this the default FS for MDK 8.0 if I haven't selected reiserfs? (My use of the term "ext2fs" was because I haven't really kept up in the FS arena, so I'm probably somewhat out-of-date.) How mature is ext3fs in general? I'm seriously considering punting reiserfs if I don't get a good answer to the questions I put to the lists, especially considering the number of hits I'm getting to data-corruption queries on search engines. They seem to paint a picture of much less product maturity than the reiserfs authors do themselves.
- - - Separator between forwarded messages - - -
- Date: Sat, 22 Sep 2001 15:47:00 -0700
From: Andrew Morton <akpm@zip> To: Lenny Foner <foner@med> Subject: Re: ReiserFS data corruption in very simple configuration Sender: akpm@vas References: <3BACFC70.73EDBD39@zip> (message from Andrew Morton on Sat, 22 Sep 2001 14:02:40 -0700) <200109222207.SAA11674@out-of-band.media.mit.edu> Lenny Foner wrote:
> > Date: Sat, 22 Sep 2001 14:02:40 -0700 > From: Andrew Morton <akpm@zip> > > The default journalling mode for ext3 will write data before > committing metadata. So this will never happen with ext3. Guaranteed. > > That's good to know. (How robust is this against the sort of caching > that typically goes on in disk drives, btw?) Much-discussed point. It's write-reordering that could bring ext3 unstuck. The time window and set of circumstances is really remote though. For starters, the drive would have to decide, when presented with a linear sequence of blocks, to write the higher-numbered ones first. Nobody has been able to demonstrate a problem yet, to my knowledge.
> I haven't checked---is this the default FS for MDK 8.0 if I haven't > selected reiserfs? (My use of the term "ext2fs" was because I haven't > really kept up in the FS arena, so I'm probably somewhat out-of-date.) Mandrake are shipping ext3. I don't know if it's the default, like redhat.
> How mature is ext3fs in general? Not very mature at all, but it's undergone a heap of testing. I'm quite confident in it.
> I'm seriously considering punting reiserfs if I don't get a good > answer to the questions I put to the lists, especially considering the > number of hits I'm getting to data-corruption queries on search > engines. They seem to paint a picture of much less product maturity > than the reiserfs authors do themselves. As Hans says, "that is the nature of metadata-only journalling". Same goes for JFS and, in theory, XFS. Although XFS is said to get it mostly-right as a side-effect of something else (not sure what). But even XFS does have data corruption problems across recovery. All of this is precisely why I started the 2.4 kernel port of ext3 - I need it for an embedded "applicance" product, and people don't expect an applicance to shit itself if they simply turn it off...
- - - Separator between forwarded messages - - -
- Date: Sun, 23 Sep 2001 00:06:26 -0400 (EDT)
From: Lenny Foner <foner@med> To: akpm@zip Subject: ReiserFS data corruption in very simple configuration Date: Sat, 22 Sep 2001 15:47:00 -0700
From: Andrew Morton <akpm@zip> Lenny Foner wrote:
> > Date: Sat, 22 Sep 2001 14:02:40 -0700 > From: Andrew Morton <akpm@zip> > > The default journalling mode for ext3 will write data before > committing metadata. So this will never happen with ext3. Guaranteed. > > That's good to know. (How robust is this against the sort of caching > that typically goes on in disk drives, btw?) Much-discussed point. I'll bet. It's write-reordering that could bring ext3 unstuck. The time window and set of circumstances is really remote though. For starters, the drive would have to decide, when presented with a linear sequence of blocks, to write the higher-numbered ones first. Nobody has been able to demonstrate a problem yet, to my knowledge. I can see it deciding to do that if the head happened to be near there already, but I'm obviously making an assumption about how block numbers are laid out on the disk. But you see my point.
> I haven't checked---is this the default FS for MDK 8.0 if I haven't > selected reiserfs? (My use of the term "ext2fs" was because I haven't > really kept up in the FS arena, so I'm probably somewhat out-of-date.) Mandrake are shipping ext3. I don't know if it's the default, like redhat. It's the default for RH? Interesting. I'll have to check my MDK.
> How mature is ext3fs in general? Not very mature at all, but it's undergone a heap of testing. I'm quite confident in it. And I suppose RH must be too, if it's their default, and presumably that's the majority of the testing base? (I'll do some websearching
for mandrake & ext3fs and see what I find.)
> I'm seriously considering punting reiserfs if I don't get a good > answer to the questions I put to the lists, especially considering the > number of hits I'm getting to data-corruption queries on search > engines. They seem to paint a picture of much less product maturity > than the reiserfs authors do themselves. As Hans says, "that is the nature of metadata-only journalling". Same goes for JFS and, in theory, XFS. Although XFS is said to get it mostly-right as a side-effect of something else (not sure what). But even XFS does have data corruption problems across recovery. Sounds completely wrong compared to what a FS is supposed to do, though. Missing data is fine, but corrupting what's left just seems like it makes a bad situation worse. People really put up with this in production use? How? All of this is precisely why I started the 2.4 kernel port of ext3 - I need it for an embedded "applicance" product, and people don't expect an applicance to shit itself if they simply turn it off...
Aaaahhhh... I see. Yeah, the fact that my TiVo doesn't have an off switch really gives me the screaming fantods; there's just -no- way to tell it, "I'm about to unplug you now." But OTOH, the way they do it is to have pretty much everything but /var and the MFS (the FS they store the video in) mounted ro, and they've actually got two copies of /, /var, and /usr (hence six partitions for those, plus extra partitions for MFS and various other things; something like 14+ in total), so if one comes up corrupted, it uses the other and then repairs the corrupted one; the boot prom keeps track of which one is current, I think. (Dunno what happens if it can't repair it; presumably it either copies one to the other or calls for help over the phone; it's not clear to me without rereading its source code.)
- - - Separator between forwarded messages - - -
- Date: Sat, 22 Sep 2001 21:18:59 -0700
From: Andrew Morton <akpm@zip> To: Lenny Foner <foner@med> Subject: Re: ReiserFS data corruption in very simple configuration Sender: akpm@vas References: <3BAD14E4.CC3544B9@zip> (message from Andrew Morton on Sat, 22 Sep 2001 15:47:00 -0700) <200109230406.AAA11988@out-of-band.media.mit.edu> Lenny Foner wrote:
> > Date: Sat, 22 Sep 2001 15:47:00 -0700 > From: Andrew Morton <akpm@zip> > > Lenny Foner wrote: > > > > Date: Sat, 22 Sep 2001 14:02:40 -0700 > > From: Andrew Morton <akpm@zip> > > > > The default journalling mode for ext3 will write data before > > committing metadata. So this will never happen with ext3. Guaranteed. > > > > That's good to know. (How robust is this against the sort of caching > > that typically goes on in disk drives, btw?) > > Much-discussed point. > > I'll bet. > > It's write-reordering that could bring ext3 unstuck. The > time window and set of circumstances is really remote though. For > starters, the drive would have to decide, when presented with > a linear sequence of blocks, to write the higher-numbered ones > first. Nobody has been able to demonstrate a problem yet, > to my knowledge. > > I can see it deciding to do that if the head happened to be near there > already, but I'm obviously making an assumption about how block > numbers are laid out on the disk. But you see my point. I do. The time window is small though - you need to get an out-of-order write like this AND pull the plug between the two blocks.
> > I haven't checked---is this the default FS for MDK 8.0 if I haven't > > selected reiserfs? (My use of the term "ext2fs" was because I haven't > > really kept up in the FS arena, so I'm probably somewhat out-of-date.) > > Mandrake are shipping ext3. I don't know if it's the default, > like redhat. > > It's the default for RH? Interesting. I'll have to check my MDK. > > > How mature is ext3fs in general? > > Not very mature at all, but it's undergone a heap of testing. > I'm quite confident in it. > > And I suppose RH must be too, if it's their default, and presumably > that's the majority of the testing base? (I'll do some websearching > for mandrake & ext3fs and see what I find.)
It went through RedHat QA OK. Better than reiserfs...
> > I'm seriously considering punting reiserfs if I don't get a good > > answer to the questions I put to the lists, especially considering the > > number of hits I'm getting to data-corruption queries on search > > engines. They seem to paint a picture of much less product maturity > > than the reiserfs authors do themselves. > > As Hans says, "that is the nature of metadata-only journalling". Same > goes for JFS and, in theory, XFS. Although XFS is said to get it > mostly-right as a side-effect of something else (not sure what). But even > XFS does have data corruption problems across recovery. > > Sounds completely wrong compared to what a FS is supposed to do, > though. Missing data is fine, but corrupting what's left just > seems like it makes a bad situation worse. People really put up > with this in production use? How? > > All of this is precisely why I started the 2.4 kernel port of > ext3 - I need it for an embedded "applicance" product, and > people don't expect an applicance to shit itself if they simply > turn it off... > > Aaaahhhh... I see. Yeah, the fact that my TiVo doesn't have an off > switch really gives me the screaming fantods; there's just -no- way to > tell it, "I'm about to unplug you now." But OTOH, the way they do it > is to have pretty much everything but /var and the MFS (the FS they > store the video in) mounted ro, and they've actually got two copies of > /, /var, and /usr (hence six partitions for those, plus extra > partitions for MFS and various other things; something like 14+ in > total), so if one comes up corrupted, it uses the other and then > repairs the corrupted one; the boot prom keeps track of which one is > current, I think. (Dunno what happens if it can't repair it; > presumably it either copies one to the other or calls for help over > the phone; it's not clear to me without rereading its source code.)
hum. Where does one find the TiVo source?
- - - Separator between forwarded messages - - -
- Date: Tue, 25 Sep 2001 14:28:54 +0100
From: "Stephen C. Tweedie" <sct@red> To: foner-reiserfs@med Subject: Re: ReiserFS data corruption in very simple configuration Cc: Nikita@Nam, Stephen Tweedie <sct@red>,
- linux-kernel@vge
References: <15276.34915.301069.643178@bet> <200109222044.QAA11638@out-of-band.media.mit.edu> Hi, On Sat, Sep 22, 2001 at 04:44:21PM -0400, foner-reiserfs@med wrote:
> Stock reiserfs only provides meta-data journalling. It guarantees that > structure of you file-system will be correct after journal replay, not > content of a files. It will never "trash" file that wasn't accessed at > the moment of crash, though. > > Thanks for clarifying this. However, I should point out that the > failure mode is quite serious---whereas ext2fs would simply fail > to record data written to a file before a sync, reiserfs seems to > have instead -swapped random pieces of one file with another-, > which is -much- harder to detect and fix. Not true. ext2, ext3 in its "data=writeback" mode, and reiserfs can all demonstrate this behaviour. Reiserfs is being no worse than ext2 (the timings may make the race more or less likely in reiserfs, but ext2 _is_ vulnerable.) e2fsck only restores metadata consistency on ext2 after a crash: it can't possibly guarantee that all the data blocks have been written. ext3 will let you do full data journaling, but also has a third mode (the default), which doesn't journal data, but which does make sure that data is flushed to disk before the transaction which allocated that data is allowed to commit. That gives you most of the performance of ext3's fast-and-loose writeback mode, but with an absolute guarantee that you never see stale blocks in a file after a crash. Cheers,
- Stephen
- - - Separator between forwarded messages - - -
- Date: Thu, 27 Sep 2001 09:56:36 +0200
From: Milos Prudek <prudek@nem> To: foner-reiserfs@med Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple configuration References: <200109222044.QAA11638@out-of-band.media.mit.edu>
> Stock reiserfs only provides meta-data journalling. It guarantees that > structure of you file-system will be correct after journal replay, not > content of a files. It will never "trash" file that wasn't accessed at > the moment of crash, though. > > Thanks for clarifying this. However, I should point out that the > failure mode is quite serious---whereas ext2fs would simply fail Hey, that was good. Finally someone with enough courage to tell the reiserfs developers about it, clearly but not in an insulting way. When I was hit by reiserfs data trashing due to lack of data journaling, I felt CHEATED. The reiserfs hype led me to believe what you believed. I'm desperate for Chris Mason's patch. -- Milos Prudek
- - - Separator between forwarded messages - - -
- Date: Sat, 29 Sep 2001 00:44:59 -0400 (EDT)
From: Lenny Foner <foner-reiserfs@med> To: sct@red Subject: ReiserFS data corruption in very simple configuration CC: Nikita@Nam, Mason@Sus CC: linux-kernel@vge, reiserfs-list@Nam [As before, please make sure you CC me on replies or I won't see them. Tnx!]
- Date: Tue, 25 Sep 2001 14:28:54 +0100
From: "Stephen C. Tweedie" <sct@red> Hi, On Sat, Sep 22, 2001 at 04:44:21PM -0400, foner-reiserfs@med wrote:
> Stock reiserfs only provides meta-data journalling. It guarantees that > structure of you file-system will be correct after journal replay, not > content of a files. It will never "trash" file that wasn't accessed at > the moment of crash, though. > > Thanks for clarifying this. However, I should point out that the > failure mode is quite serious---whereas ext2fs would simply fail > to record data written to a file before a sync, reiserfs seems to > have instead -swapped random pieces of one file with another-, > which is -much- harder to detect and fix. Not true. ext2, ext3 in its "data=writeback" mode, and reiserfs can all demonstrate this behaviour. Reiserfs is being no worse than ext2 (the timings may make the race more or less likely in reiserfs, but ext2 _is_ vulnerable.)
- e2fsck only restores metadata consistency on ext2 after a crash: it can't possibly guarantee that all the data blocks have been written.
- ext3 will let you do full data journaling, but also has a third mode (the default), which doesn't journal data, but which does make sure that data is flushed to disk before the transaction which allocated that data is allowed to commit. That gives you most of the performance of ext3's fast-and-loose writeback mode, but with an absolute guarantee that you never see stale blocks in a file after a crash.
- (E.g., one or both files might have been partially or completely updated.)
- (Neither file got updated at all.)
- (E.g., File A gets some of file B written somewhere within it, and file B gets some of file A written somewhere within it---this is the behavior I observed, at least twice, with reiserfs.)
- ("Undefined" means could be any of (a) through (d) above; I don't care.)
reiserfs written Chris Mason <Mason@Sus>, but has not responded with a URL to it; can someone (or Chris? now CC'ed) do so? A search for reiserfs and mason is useless, yielding 12,000 hits. (I'm particularly interested in one for reiserfs 3.6.25 and Mandrake 8.0, but I assume there may be several variants in the same repository.) Benchmarking data on the performance impact of data journalling for reiserfs, ext3fs, and anything else anyone cares to supply would probably be useful to lots of people at well. P.P.S. I say reset and not power-off, although I hope that this is moot, because I presume that the unsynced data, by virtue of being unsynced, is nowhere near the disk datapaths anyway. But either way, a reset should let the disks continue to write data out of their write buffers, assuming that a CPU reset doesn't flush such pending transactions; I don't know if there's some IDE bus sequence that can do this, and whether CPU reset would issue such a sequence. It may not matter; is it common that disks might leave data buffered but unwritten for 30 seconds if there is no other disk activity? I would hope that this is -not- true and that the buffered data is buffered only while there is other activity, since failing to flush the buffer when the disk is idle only increases the risk of losing it without improving performance at all.
- Date: Tue, 25 Sep 2001 14:28:54 +0100
- - - Separator between forwarded messages - - -
- Date: Sat, 29 Sep 2001 14:52:29 +0200
From: <pcg@goo ( Marc) (A.) (Lehmann )> To: Lenny Foner <foner-reiserfs@med> Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple configuration Cc: sct@red, Nikita@Nam, Mason@Sus,
- linux-kernel@vge, reiserfs-list@Nam
Mail-Followup-To: Lenny Foner <foner-reiserfs@med>,
- sct@red, Nikita@Nam, Mason@Sus, linux-kernel@vge, reiserfs-list@Nam
References: <20010925142854.A5384@red> <200109290444.AAA19624@out-of-band.media.mit.edu> X-Operating-System: Linux version 2.4.8-ac8 (root@cer) (gcc version 3.0.1)
On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner <foner-reiserfs@med> wrote: > isn't fixed by fsck? [See outcome (d) below.] I'm having difficulty > believing how this can be possible for a non-journalling filesystem. If you have difficulties in believing this, may I ask you how you think it is possible for a non-journaling filesystem to prevent this at all?
> But what about written to the wrong files? See below.
What you see is most probably old data, not data from another (still existing) file.
> has not happened yet. (I don't know how often reiserfs will be synced > by default; 60 seconds? Longer? Presumably running "sync" will force mostly like with any other filesystem (man bdflush)
> Now, we have the following possibilities for the outcome after the
> (a) Metadata correctly written, file data correctly written.
all filesystems
> (b) Metadata correctly written, file data partially written. > (E.g., one or both files might have been partially or completely > updated.) ext2, reiserfs.
> (c) Metadata correctly written, file data completely unwritten. > (Neither file got updated at all.) ext2, reiserfs.
> (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B. this shouldn't happen on reiserfs. however, the unwritten parts of file a can easily contain data formerly in file b.
> (e) Metadata corrupted in some fashion, file data undefined. > ("Undefined" means could be any of (a) through (d) above; I don't care.) this should be prevented by journaling (of course, this won't help against harddisk failures) on reiserfs. ext2 often has this problem, but fsck usually can repair it. it's easy to tell metadata from filedata on ext2.
> whether we can "guarantee that all the data blocks have been written", > but may be missing the point I was making, namely that THE BLOCKS HAVE > BEEN WRITTEN TO THE WRONG FILES. remember that the blocks have previous content, and reiserfs' tails optimization means that files appended all the time (wtmp) can move around rapidly (at least their tail).
> P.P.S. I say reset and not power-off, although I hope that this is > moot, because I presume that the unsynced data, by virtue of being > unsynced, is nowhere near the disk datapaths anyway. this can make a big difference. many disks (ibm, maxtor) nowadays write partial blocks on power outage, this gives "Uncorrectable read errors", which is fatal, because no filesystem so far can work around this. It's easy to repair (just rewrite the block), but would requite filesystem
feedback (hey, reisrefs, this metadata block is garbage).
> a reset should let the disks continue to write data out of their write > buffers, assuming that a CPU reset doesn't flush such pending they should, yes. OTOH, ide disks are cheap...
> not matter; is it common that disks might leave data buffered but > unwritten for 30 seconds if there is no other disk activity? I would no. and it doesn't make sense. but it's not forbidden or sth. --
==- |
==-- _ |
---==-_) Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goo |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
- The choice of a GNU generation |
- |
- - - Separator between forwarded messages - - -
- Date: Sun, 30 Sep 2001 21:00:49 -0400 (EDT) From: foner-reiserfs@med To: pcg@goo Subject: [reiserfs-list] ReiserFS data corruption in very simple configuration CC: sct@red, Nikita@Nam, Mason@Sus,
- linux-kernel@vge, reiserfs-list@Nam
- Date: Sat, 29 Sep 2001 14:52:29 +0200
From: <pcg@goo ( Marc) (A.) (Lehmann )>
On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner <foner-reiserfs@med> wrote: > isn't fixed by fsck? [See outcome (d) below.] I'm having difficulty > believing how this can be possible for a non-journalling filesystem. If you have difficulties in believing this, may I ask you how you think it is possible for a non-journaling filesystem to prevent this at all?
(<60 second old) data into different files than the data was destined for. (I suppose the assumption I'm making here is that, when creating or extending a file, the metadata is written -last-, e.g., file blocks are allocated, file data is written, and -then- metadata is written. That way, a failure anywhere before finality simply seems to vanish, whereas writing metadata first seems to cause the lossage below.)
> But what about written to the wrong files? See below. What you see is most probably *old* data, not data from another (still existing) file. I'm... dubious, but maybe. As mentioned earlier in this thread, one of the failures I saw consisted of having several lines of my XFree86.0.log file appended to wtmp---when I logged in after the failure, I got "Last login: " followed by several lines from that file instead of a date. (Other failures scrambled other files worse.) Now, it's -possible- that rsfs allocated an extra portion to the end of wtmp for the last-login data (as a user of the fs, I don't care whether officially this was a "block", an entry in a journal, etc), login "wrote" to that region (but it wasn't committed yet 'cause no sync), my XFree86.0.log file was "created" and "written" (again uncommitted), I pushed reset, and then when it came back up, the end of wtmp had data from the -previous- copy of XFree86.0.log that had been freed (because it was unlinked when the next copy was written) but which had not actually had the wtmp data written to it yet (because a sync hadn't happened). I have no way to verify this, since one XFree86.0.log looks much like the other. Conceptually, this would imply that wtmp was extended into disk freespace, which just happened to have that logfile in it (instead of zero bytes). Is this what you're talking about when you say "*old* data"? I think so, and that seems to match your comment below about file tails moving around rapidly. But it doesn't explain -why- it works this way in the first place. Wouldn't it make more sense to commit metadata to disk -after- the data blocks are written? After all, if -either one- isn't written, the file is incomplete. But if the metadata is written -last-, the file simply looks like the data was never added. If the metadata is written -first-, the file can scoop up random trash from elsewhere in the filesystem. I contend that this is -much- worse, because it can render a previously-good file completely unparseable by tools that expect that -all- of the file is in a particular syntax. It's just an accident, I guess, that login will accept any random trash when it prints its "last-login" message, rather than falling over with a coredump because it doesn't look like a date. [And see * below.] Unfortunately, this behavior meant that X -did- fall over, because my XF86Config file was trashed by being scrambled---I'd recently written out a new version, after all---and the trashed copy no longer made any sense. I would have been -much- happier to have had the -unmodified-, -old- version than a scrambled "new" version! Without Emacs ~ files, this would have been much worse. Consider an app that, "for reliability", rewrites a file by creating a temp copy, writing it out, then renaming the temp over the original [this is how Emacs typically saves files]. But if you write the metadata first, you foil this attempt to be safe, because you might have this sequence at the actual disk: [magnetic oxide updated w/rename][start updating magnetic oxide with tempfile data][power failure or reset]---ooops! original file gone, new file doesn't have its data yet, so sorry, thanks for playing. By writing metadata first, it seems that reiserfs violates the idempotence of many filesystem operations, and does exactly the opposite of what "journalling" implies to anyone who understands databases, namely that either the operation completes entirely, or it is completely undone. Yes, yes, I know (now!) that it claims to only journal the metadata, but how does this help when what it's essentially doing is trashing the -data- in unexpected ways exactly when such journalling is supposed to help, namely across a machine failure? This seems like such an elementary design defect that I'm at a loss to understand why it's there. There -must- be some excellent reason, right? But what? And if not, can it be fixed? I'm also still waiting to find out how to make reiserfs actually journal its data, and what the performance implications of this are. No one has responded with a URL. [*] It's also a security hole. If I want to read a file that I'm not authorized to read, -but- I can cause a kernel panic (or a blackout!), then I can craftily wait until up to several seconds after the "secure" file is being rewritten (presumably via the write-tempfile- and-relink method), create a big file of my own, and force the panic---my file may then get some of the secure blocks from the old copy. And, unlike filesystems that write metadata last, the "secure" program can't just zero out the blocks of the file it's about to unlink, because---since metadata is written first---those zeroes won't have made it to disk yet even though the blocks have been declared free and included in my file. I now know what's in your file. Whoops. And this is such an enormous timing hole that I can write a program that just checks every 5 seconds or so for a new copy of the secure file, -then- forces the failure---I need not get the timing very good, as long as it's likely that I'll do so before the next sync. It's so bad that, even if I can't force a panic, my program can just beep and I'll immediately go short out the outlet that happens to be on the same circuit as the machine I'm attacking.
- [ . . . ]
> (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B. this shouldn't happen on reiserfs. however, the unwritten parts of file a can easily contain data formerly in file b. Then why allow metadata to be written first instead of last?
> (e) Metadata corrupted in some fashion, file data undefined. > ("Undefined" means could be any of (a) through (d) above; I don't care.) this should be prevented by journaling (of course, this won't help against harddisk failures) on reiserfs. ext2 often has this problem, but fsck usually can repair it. it's easy to tell metadata from filedata on ext2.
> whether we can "guarantee that all the data blocks have been written", > but may be missing the point I was making, namely that THE BLOCKS HAVE > BEEN WRITTEN TO THE WRONG FILES. remember that the blocks have previous content, and reiserfs' tails optimization means that files appended all the time (wtmp) can move around rapidly (at least their tail). [ . . . ]
- - - Separator between forwarded messages - - -
- Date: Mon, 1 Oct 2001 03:26:27 +0200
From: <pcg@goo ( Marc) (A.) (Lehmann )> To: foner-reiserfs@med Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple configuration Cc: sct@red, Nikita@Nam, Mason@Sus,
- linux-kernel@vge, reiserfs-list@Nam
- Nikita@Nam, Mason@Sus, linux-kernel@vge, reiserfs-list@Nam
References: <20010929145229.C26231@sch> <200110010100.VAA07189@out-of-band.media.mit.edu> X-Operating-System: Linux version 2.4.8-ac9 (root@cer) (gcc version 3.0.1) On Sun, Sep 30, 2001 at 09:00:49PM -0400, foner-reiserfs@med wrote:
> extending a file, the metadata is written -last-, e.g., file blocks > are allocated, file data is written, and -then- metadata is written. this is almost impossible to achieve with existing hardware (witness the many discussions about disk caching for example), and, without journaling, might even be slow.
> of wtmp had data from the -previous- copy of XFree86.0.log that had > been freed (because it was unlinked when the next copy was written) > but which had not actually had the wtmp data written to it yet It's easily possible, but it could also be a bug. Let's the reiserfs authors decide. However, if it is indeed "a bug" then fixing it would only lower the frequency of occurance. Only ext3 (some modes) + turning off your harddisk's cache can ensure this, at the moment.
> to have that logfile in it (instead of zero bytes). Is this what > you're talking about when you say "*old* data"? I think so, and that > seems to match your comment below about file tails moving around > rapidly. appending to logfiles will result in a lot of movement. with other, strictly block-based filesystems this occurs relatively frequent, and data will not usually move around. with reiserfs tail movement is frequent.
> Wouldn't it make more sense to commit metadata to disk -after- the > data blocks are written? The problem is that there is currently no easy way to achieve that.
> file simply looks like the data was never added. If the metadata is > written -first-, the file can scoop up random trash from elsewhere in Also, this is not a matter of metadata first or last. Sometimes you need metadata first, sometimes you need it last. And in many cases, "metadata" does not need to change, while data still changes.
> the filesystem. I contend that this is -much- worse, because it can > render a previously-good file completely unparseable by tools that > expect that -all- of the file is in a particular syntax. It depends - with ext2 you frequently have garbled files, too. Basically, if you write to a file and turn off the power the outcome is unexpected, and will always be (unless you are ready to take the big speed hit).
> Unfortunately, this behavior meant that X -did- fall over, because my > XF86Config file was trashed by being scrambled---I'd recently written > out a new version, after all---and the trashed copy no longer made any But the same thing can and does happen with ext2, depending on your editor and your timing. It is not a reiserfs thing.
> But if you write the metadata first, you foil this attempt to be safe, > because you might have this sequence at the actual disk: [magnetic > oxide updated w/rename][start updating magnetic oxide with tempfile > data][power failure or reset]---ooops! original file gone, new file > doesn't have its data yet, so sorry, thanks for playing. Of course. If you want data to hit the disk, you have to use fsync. This does work with reiserfs and will ensure that the data hits the disk. If you don't do this then bad things might happen.
> By writing metadata first, it seems that reiserfs violates the > idempotence of many filesystem operations, and does exactly the > opposite of what "journalling" implies to anyone who understands > databases, namely that either the operation completes entirely, or it > is completely undone. You are confusing databases with filesystems, however. Most journaling filesystems work that way. Some (like ext3) are nice enough to let you choose.
> journal the metadata, but how does this help when what it's essentially > doing is trashing the -data- in unexpected ways exactly when such > journalling is supposed to help, namely across a machine failure? But ext2 works in the same way. It does happen more often with reiserfs (especially with tails), but ignoring the problem for ext2 doesn't make it right. If applications don't work reliably with reisrefs, they don't work reliably with ext2. If you want reliability then mount synchronous.
> This seems like such an elementary design defect that I'm at a loss > to understand why it's there. About every filesystem does have this "elementary design defect". If you want data to hit the disk, sync it. Its that simple.
> There -must- be some excellent reason, > right? But what? And if not, can it be fixed? Speed is an excellent reason. The fix is to tell the kernel to write the data out to the platters. Anyway, this is a good time to review the various discussions on the reiserfs list and the kernel list on how to teach the kernel (if it is possible) to implement loose write-ordering. --
==- |
==-- _ |
---==---(_) Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goo |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
- The choice of a GNU generation |
- |
- - - Separator between forwarded messages - - -
- Date: Sun, 30 Sep 2001 22:32:47 -0400 (EDT) From: foner-reiserfs@med To: pcg@goo Subject: [reiserfs-list] ReiserFS data corruption in very simple configuration CC: sct@red, Nikita@Nam, Mason@Sus,
- linux-kernel@vge, reiserfs-list@Nam
From: <pcg@goo ( Marc) (A.) (Lehmann )> On Sun, Sep 30, 2001 at 09:00:49PM -0400, foner-reiserfs@med wrote:
> extending a file, the metadata is written -last-, e.g., file blocks > are allocated, file data is written, and -then- metadata is written. this is almost impossible to achieve with existing hardware (witness the many discussions about disk caching for example), and, without journaling, might even be slow. I think perhaps we may be talking past each other; let me try to clarify. As I said earlier in this thread, this has nothing at all to do with disk caching. Let me restate this again: The scenario I'm discussing is an otherwise-idle machine that had 2 (maybe 3) files modified, sat idle for 30-60 seconds, and then had the reset button pushed. I would expect that either file data and metadata got written, or neither got written, but not metadata without file data. This is repeatable more or less at will---I didn't -just- happen to catch it -just- as it decided to frob the disks. Instead, the problem seems to be that reiserfs is perfectly happy to update the on-disk representation of which disk blocks contain which files' data, and then -sit there- for a long time (a minute? longer?) without -also- attempting to flush the file data to the disk. This then leads to corrupted files after the reset. It's not that the CPU sent data to the disk subsystem that failed to be written by the time of the interruption; it's that the data was still sitting in RAM and the CPU hadn't even decided to get it out the IDE channel yet. This means that there is -always- a giant timing hole which can corrupt data, as opposed to just the much-tinier hole that would be created if the file-bytes-to-disk-bytes correspondence were updated immediately after the write that wrote the data---it would be hard for me to accidentally hit such a hole.
> of wtmp had data from the -previous- copy of XFree86.0.log that had > been freed (because it was unlinked when the next copy was written) > but which had not actually had the wtmp data written to it yet It's easily possible, but it could also be a bug. Let's the reiserfs authors decide. However, if it is indeed "a bug" then fixing it would only lower the frequency of occurance. True, but as long as it makes it only happen if the disk is -in progress of writing stuff- when the reset or power failure happens, the risk is -greatly- reduced. Right now, it's an enormous timing hole, and one that's likely to be hit---it's happened to me -every single time- I've had to hit the reset button because (for example) I wedged X while debugging, and even if I waited a minute after the wedge-up to do so! The way I've avoided it is by running a job that syncs once a second while doing debugging that might possibly make me unable to take the machine down cleanly. This is a disgusting and unreliable kluge. Only ext3 (some modes) + turning off your harddisk's cache can ensure this, at the moment. Or ext3 (some modes) + assuming that the disk will at least write data that's been sent to it, even if the CPU gets reset. (I know it's hopeless if power fails, but that can be made arbitrarily unlikely, compared to a kernel panic or having to do a CPU reset.)
> to have that logfile in it (instead of zero bytes). Is this what > you're talking about when you say "*old* data"? I think so, and that > seems to match your comment below about file tails moving around > rapidly. appending to logfiles will result in a lot of movement. with other, strictly block-based filesystems this occurs relatively frequent, and data will not usually move around. with reiserfs tail movement is frequent. Right.
> Wouldn't it make more sense to commit metadata to disk -after- the > data blocks are written? The problem is that there is currently no easy way to achieve that. Why not? (Ignore the disk-caching issue and concentrate on when the kernel asks for data to be written to the disk. I am -assuming that the kernel either (a) writes the data in the order requested, or at least (b) once it decides to write anything, keeps sending it to the disk until its queue is completely empty.)
> file simply looks like the data was never added. If the metadata is > written -first-, the file can scoop up random trash from elsewhere in Also, this is not a matter of metadata first or last. Sometimes you need metadata first, sometimes you need it last. And in many cases, "metadata" does not need to change, while data still changes. I'm using "metadata" here as a shorthand for "how the filesystem knows which byte on disk corresponds to which byte in the file", not just things like atime, ctime, etc.
> the filesystem. I contend that this is -much- worse, because it can > render a previously-good file completely unparseable by tools that > expect that -all- of the file is in a particular syntax. It depends - with ext2 you frequently have garbled files, too. Basically, if you write to a file and turn off the power the outcome is unexpected, and will always be (unless you are ready to take the big speed hit).
> Unfortunately, this behavior meant that X -did- fall over, because my > XF86Config file was trashed by being scrambled---I'd recently written > out a new version, after all---and the trashed copy no longer made any But the same thing can and does happen with ext2, depending on your editor and your timing. It is not a reiserfs thing. Well, I've gotten several pieces of private mail from people complaining that it's happening a lot more with reiserfs. And I've never been bitten this way in years of ext2 usage.
> But if you write the metadata first, you foil this attempt to be safe, > because you might have this sequence at the actual disk: [magnetic > oxide updated w/rename][start updating magnetic oxide with tempfile > data][power failure or reset]---ooops! original file gone, new file > doesn't have its data yet, so sorry, thanks for playing. Of course. If you want data to hit the disk, you have to use fsync. This does work with reiserfs and will ensure that the data hits the disk. If you don't do this then bad things might happen. It's that I either want the data to hit the disk, or -not- to hit the disk, but not to partially-update files such that things are inconsistent even when the disk has been idle for 20 seconds and the system isn't doing anything else. It's even worse in that the filesystem -believes- itself to be accurate, even though the data it's actually storing is scrambled.
> By writing metadata first, it seems that reiserfs violates the > idempotence of many filesystem operations, and does exactly the > opposite of what "journalling" implies to anyone who understands > databases, namely that either the operation completes entirely, or it > is completely undone. You are confusing databases with filesystems, however. Most journaling filesystems work that way. Some (like ext3) are nice enough to let you choose. I am deliberately talking about databases, because the terminology and technology of journalling came from there. Using the term "journalling" and then behaving very differently from the way it's used in database design is misleading at best. Several people who've written to me have said they felt "cheated" to discover that reiserfs didn't actually journal the data or otherwise misbehaved in ways similar to my problem here.
- - - Separator between forwarded messages - - -
- Date: Mon, 1 Oct 2001 05:43:08 -0400
From: Chris Siebenmann <cks@utc> To: foner-reiserfs@med Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple configuration X-Newsgroups: mail.linux.kernel Organization: Ziebmef home away from home You write: | But it doesn't explain -why- it works this way in the first place. | Wouldn't it make more sense to commit metadata to disk -after- the | data blocks are written? [...]
- A vaguely naieve viewpoint: It depends on what you are maximizing, and it depends on what sort of
- Doing this probably creates some interesting ordering dependencies
file -- you must insure that the data blocks are not written before the delete commits in the journal, so you can't just do 'write all related data blocks just before a journal commit'. --- "I shall clasp my hands together and bow to the corners of the world."
- Number Ten Ox, "Bridge of Birds"
- - - Separator between forwarded messages - - -
- Date: Mon, 1 Oct 2001 12:30:17 +0100
From: "Stephen C. Tweedie" <sct@red> To: Lenny Foner <foner-reiserfs@med> Subject: Re: ReiserFS data corruption in very simple configuration Cc: sct@red, linux-kernel@vge, reiserfs-list@Nam References: <20010925142854.A5384@red> <200109290444.AAA19624@out-of-band.media.mit.edu> Hi, On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner wrote:
> Not true. ext2, ext3 in its "data=writeback" mode, and reiserfs can > all demonstrate this behaviour. Reiserfs is being no worse than ext2 > (the timings may make the race more or less likely in reiserfs, but > ext2 _is_ vulnerable.) > > ext2fs can write parts of file A to file B, and vice versa, and this > isn't fixed by fsck?
No, we're not talking about incorrect writes, but incomplete writes, which is a totally different thing. An ext2 write of new data involves many steps: the inode needs to be written to mark the file's new size, the indirect mapping block[s] may have to be written to record where the data is, and the data blocks themselves need to be written. Not only that, but a delete also requires multiple writes. If you delete a file and rapidly create a new one, then the image of the filesystem in cache remains totally consistent, but the copy on disk is updated incrementally and if you crash before the entire image is updated, you can end up seeing both bits of the old file that was in the process of being deleted, and the new file that was being created. In addition, journaling prevents metadata inconsistencies from occuring due to incomplete writes, but on its own, metadata journaling doesn't mean that the data blocks are also in sync --- the disk blocks describing a new file might be on disk, but the data blocks that the file contains might not be. Reiserfs, and also ext3 in its fastest "writeback" mode, both behave like this (but ext3's other modes order data writes so that this situation never happens: data blocks are always flushed to disk before the metadata is committed.)
> e2fsck only restores metadata consistency on ext2 after a crash: it > can't possibly guarantee that all the data blocks have been written. > > But what about written to the wrong files? See below.
See above. If all the metadata is intact, how can e2fsck possibly detect whether a data block contains the old or the new contents of the block?
> Let's take this scenario: Files A and B have had blocks written to > them sometime in the recent past (30 to 60 seconds or so) and a sync > has not happened yet. (I don't know how often reiserfs will be synced > by default; 60 seconds? Longer? Presumably running "sync" will force > it, but I don't know when else it will happen.) File A may have been > completely rewritten or newly written (e.g., what Emacs does when it > saves a file), whereas file B may have simply been appended to (e.g., > what happens when wtmp is updated). > > The CPU reset button is then pushed. [See P.P.S. at end of this message.] > > Now, we have the following possibilities for the outcome after the > system comes back up and has finished checking its filesystem: > > (a) Metadata correctly written, file data correctly written. > (b) Metadata correctly written, file data partially written. > (E.g., one or both files might have been partially or completely > updated.) > (c) Metadata correctly written, file data completely unwritten. > (Neither file got updated at all.) > (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B. > (E.g., File A gets some of file B written somewhere within it, > and file B gets some of file A written somewhere within it---this > is the behavior I observed, at least twice, with reiserfs.) > (e) Metadata corrupted in some fashion, file data undefined. > ("Undefined" means could be any of (a) through (d) above; I don't care.) > > Now, which filesystems can show each outcome? I don't know. I > contend that reiserfs does (d). Stephen Tweedie talks above about > whether we can "guarantee that all the data blocks have been written", > but may be missing the point I was making, namely that THE BLOCKS HAVE > BEEN WRITTEN TO THE WRONG FILES. For ext3, (d) will never happen in this case. You can only get
"wrong" data blocks if one of the files is being deleted, and its blocks have been allocated to a new file, and the handover of those blocks is incomplete at the time of the crash. ext3 will only give you (a) (both metadata and data correctly written) or (f) (neither have yet been written at all) if it is running in ordered or data-journaling mode. (b) and (c) are possible only if you are in writeback mode. (d) and (e) never happen if you're creating two files, although in writeback mode (d) is possible if, say, you are deleting A and writing B at the same time (the other ext3 modes prevent this scenario too.) Cheers,
- Stephen
- - - Separator between forwarded messages - - -
- Date: Mon, 01 Oct 2001 19:27:31 +0400
From: Hans Reiser <reiser@nam> To: foner-reiserfs@med Subject: Re: ReiserFS data corruption in very simple configuration CC: linux-kernel@vge This is the meaning of metadata journaling: that writes in progress at the time of the crash may write garbage, but you won't need to fsck. You can get this behaviour with other filesystems like FFS also. If you cannot accept those terms of service, you might use ext3 with data journaling on, but then your performance will be far worse. It is a tradeoff, not a bug. Regarding where to email these types of reiserfs questions, you might email reiserfs-list@nam with such questions, or try www.namesys.com/support.html if you want paid support service on it. Best, Hans foner-reiserfs@med wrote:
> > [Please CC me on any replies; I'm not on linux-kernel.] > > The ReiserFS that comes with both Mandrake 7.2 and 8.0 has > demonstrated a serious data corruption problem, and I'd like > to know (a) if anyone else has seen this, (b) how to avoid it, > and (c) how to determine how badly I've been bitten. > > My configuration in each case has been an AMD CPU running ReiserFS > exactly as configured "out of the box" by running the Mandrake 7.2 or > 8.0 installation CD and opting to run ReiserFS instead of the default. > This is a uniprocessor machine with one IDE 80GB Maxtor disk---no RAID > or anything fancy like that. The hardware itself is rock solid and > has never demonstrated any faults at all. (MDK 8.0 appears to use > RSFS 3.6.25; I'm not longer running MDK 7.2, so I can't check that.) > The machine had barely been used before each corruption problem; I'm > not running some strange root-priv stuff, and each time, the FS hadn't > had more than a few minutes to a few hours of use since being created. > > In each case, I've gotten in trouble by editing my XF86Config-4 file, > guessing wrong on a modeline, hanging X (blank gray screen & no > response to anything), and being forced to hit the reset button > because nothing else worked. Under 7.2, I discovered that my > XF86Config-4 file suddenly had a block of nulls in it. That time, I > thought I must have been hallucinating, but I ran a background job to > sync the filesystem every second while continuing to debug the X > problems, and didn't see the corruption again. > > Now, I was just bitten by the -same- behavior under MDK 8.0. After > accidentally hanging X, I waited a few seconds just in case a sync was > pending, hit reset, and had all sorts of lossage: > (1) Parts of the XF86Conf-4 file had lines garbled, e.g., > sections of the file had apparently been rearranged. > (2) /var/log/XFree86.0.log was truncated, and maybe garbled. > (2) Logging in as root was fine, but then logging in as myself > I got "Last login: <4-5 lines of my XFree86.0.log file (!)>" > instead of a date! Logging in again gave me the proper > last-login time, but clearly wtmp or something else had > gotten stepped on in some weird way. > Obviously, the behavior I saw once under MDK 7.2 was no hallucination > or accidental yank in Emacs. > > I thought the whole point of a journalling file system was to > -prevent- corruption due to an unexpected failure! This seems to be > -far- worse than a normal filesystem---ext2fs would at least choke and > force fsck to be run, which might actually fix the problem, but this > is ridiculous---it just silently trashes random files. > > So I now have possibly-undetected filesystem damage. My -guess- is > that only files written within a few minutes of the reset are likely > to be affected, but I really don't know, and don't know of a good way > to find out. Must I reinstall the OS -again-, starting from a blank > partition, to be sure? Maybe I should just give up on ReiserFS completely. > > [If there is a more-appropriate place for me to send this---such as > a particular Mandrake list, or a particular ReiserFS list---please let > me know, particularly if I can get a quick answer -without- going > through the overhead of subscribing to the list, being flooded, and > unsubscribing---that's what archives are for. Some websearching > for "ReiserFS corruption" yielded -thousands- of hits---not a good > sign---and a very large proportion of them were on this list, so I > figure this is as good a place to ask as any. Thanks again.] > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vge > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/
- - - Separator between forwarded messages - - -
- Date: Wed, 3 Oct 2001 17:17:03 +0100
From: "Stephen C. Tweedie" <sct@red> To: Hans Reiser <reiser@nam> Subject: Re: ReiserFS data corruption in very simple configuration Cc: foner-reiserfs@med, linux-kernel@vge,
Stephen Tweedie <sct@red>
References: <200109221000.GAA11263@out-of-band.media.mit.edu> <3BB88B63.AEE6EF8E@nam> Hi, On Mon, Oct 01, 2001 at 07:27:31PM +0400, Hans Reiser wrote:
> This is the meaning of metadata journaling: that writes in progress at the time > of the crash may write garbage, but you won't need to fsck. You can get this > behaviour with other filesystems like FFS also. If you cannot accept those > terms of service, you might use ext3 with data journaling on, but then your > performance will be far worse. ext3 with ordered data writes has performance nearly up to the level of the fast-and-loose writeback mode for most workloads, and still avoids ever exposing stale disk blocks after a crash. Sure, it's a tradeoff, but there are positions between the two extremes (totally unordered data writes, and totally journaled data writes) which offer a good compromise here. Cheers, Stephen
- - - Separator between forwarded messages - - -
- Date: Wed, 03 Oct 2001 17:28:13 +0100
From: Toby Dickenson <tdickenson@dev> To: pcg@goo Subject: Re: [reiserfs-list] ReiserFS data corruption in very simple configuration Cc: foner-reiserfs@med, sct@red, Nikita@nam, Mason@Sus, linux-kernel@vge, reiserfs-list@nam Reply-To: tdickenson@gem References: <20010929145229.C26231@sch> <200110010100.VAA07189@out-of-band.media.mit.edu> <20011001032627.A9991@sch>
>Of course. If you want data to hit the disk, you have to use fsync. This >does work with reiserfs and will ensure that the data hits the disk. If >you don't do this then bad things might happen. This is probably a naive question, but this thread has already proved me wrong on one naive assumption..... If the sequence is:
- append some data to file A
- fsync(A)
- append some further data to A
- some writes to other files
- power loss Is it guaranteed that all the data written in step 1 will still be intact? The potential problem I can see is that some data from step 1 may have been written in a tail, the tail moves during step 3, and then the original tail is overwritten before the new tail (including data from before the fsync) is safely on disk. Thanks for your help, Toby Dickenson tdickenson@gem
- - - End of forwarded messages - - -
=== An alternative view, by Keith Lofstrom
Data corruption during a power fail is an important issue. The dates in this collection of emails are all from 2001, and no version is mentioned for most of these problems. Reiserfs has gone through many versions and changes, and may or may not suffer from the same problems and bugs. A more up to date version of this discussion, and a summary of issues as they affect dirvish data files in a multiple-hard-linked rsync repository, would be helpful.
Personally, I would not run reiserfs as a random access main file system, for some of the reasons buried in that very long message. However, most issues do not apply to rsync generated data files the way dirvish uses it. The only files likely to be corrupted during a power failure would be part of a failed image, and thus inconsequential. The important thing is protecting file system pointers and metadata, and if I read the above correctly those are properly preserved through failures even for the early (and unnamed) versions of reiserfs that are being castigated here.
There is the worry that reiserfs will stuff a fragment of a new file onto the tail of a much older file, and thus corrupt an existing backup, but I would guess that all those old tails are long since filled up with little directory snippits. An expire may leave a lot of holes, though.
I use reiserfs-3 for my dirvish hard drives because it makes effective use of the disk space. Ext3 is horribly inefficient for that, most particularly because of the way it uses fixed-sized inode tables and wastes space for small files. With a rsync repository, the usual result is a target drive that fills up far too rapidly, maxing out at 70% usage when the inode table fills, or when all those tiny little directories and files that rsync chew up the available data space, one whole disk block to store a few dozen bytes. To compensate for that, many dirvish users must do frequent and deep expires, exposing the data structures on the disk to far more write activity than simple accumulative operation.
I typically back up around 100GB of data daily, and get around 150 non-expired images on a 250GB target drive before I retire it. I do not need to do expires. To mitigate the chances of disk failure (it has happened once) I do a rotating swap of 3 drives, so even if I lose the drive in the machine I still have recent backups on two other drives. This is affordable with reiserfs because I can get many more images on a drive. When I used ext3 for target drives, before I switched to reiserfs, I got a much less usage out of the target drives.
I also would not use reiserfs on really ancient hard drives. Disk drives have cache buffers, which need to be completely written out to disk, and the heads parked, immediately after a power failure. When newer drives sense a power failure, they use the energy stored in the turning spindle to power the drive long enough to write out the cache buffers and park the heads. Older drives do not do this, and partially written sectors can result, or sectors partly written in an unexpected order. Ext3 is more robust, so it is more likely to tolerate this kind of abuse. Reiser is more likely to corrupt data in these circumstances. So don't use reiserfs on old drives! I would guess that any drive with capacities of greater than 100GB will write out its entire cache properly before halting.
So it is a tradeoff, both ext3 and reiserfs have problems, and I find the problems with reiserfs less troublesome than ext3. Other dirvish users will disagree, and so the best thing for everyone is to report your empirical experience with file systems, used as dirvish banks, to the mailing list and to the wiki.
Ideally, someone will invent a file system that allocates disk space efficiently like reiserfs, without inode limitations and wasted space, and that also treats data more securely during power failures like ext3. I hope some of you are on the lookout for this.
Keith Lofstrom 2006 Sept 8
