I read Wolf’s I/O error treatise this morning (as well as Alaistair’s response), and thought I’d write a bit about how SuperDuper! actually handles I/O errors, and why. (In fact, this is an expansion and reworking of some email I dropped to Jonathan after reading the article.)
Although Wolf says otherwise, ditto isn’t our underlying engine. We use a variety of APIs in Cocoa and Carbon, augmented with much additional metadata copying. However, when we get a failure with those (such as an I/O error), we retry twice more: once with copyfile and once—just in case—with ditto, verifying after each one.
We do this because we’ve seen the rare case where one API fails but others do not. Weird, I know, but it happens.
If all three retries fail, we stop. This is done for safety: an I/O error could mean the drive is failing, and since you’re dealing with a live backup, it’s important to understand what’s going on. If a significant failure is occurring, steps should be taken to concentrate on recovering your user files, rather than trying to copy the whole drive. We don’t want a user (or SuperDuper!) to continue past the failure: we want them to stop, diagnose and—if necessary—get help. And since most users won’t know what to do (unlike Wolf or Alaistair, clearly), we make it really easy to contact support.
Our User’s Guide has a Troubleshooting section that helps a user determine whether the error is on the source or destination (I don’t explain there how to use the system log and a System Profiler report to locate the source, because it’s pretty obscure stuff—the amount of detail in our log is confusing enough for most), as well as general steps for recovery. But in most situations, 4K of 0s will pretty much be a fatal problem for the file. (I’m shocked, frankly, that Wolf’s Parallels disk was OK given the damage: he was very lucky.)
Most of the time, the problem is actually an iSight camera, iPod, or other bus-powered device misbehaving on the FireWire bus. On occasion, the problem is with the source.
Errors on the source are problematic. As Alastair mentions, modern disk controllers transparently relocate sectors when errors occur. Real problems happen when the drive’s out of spares, or when the on-disk error correction can’t handle the failure. And at this point, the drive has probably been silently failing for a while.
In many cases, SMART status will flag a drive that’s failing badly—SMART Reporter, a nice bit of freeware from Julian Mayer, can give you an obvious warning when this occurs, or even run a program (like SuperDuper!) to do a quick backup of critical files. But, often, it won’t, and experienced guidance and advice is necessary to help people understand what’s going on.
Anyway, as Wolf’s article indicates, and Alaistair agrees, it’s very difficult to continue in a way that ensures data is preserved as much as possible. It’s hard to know what really happened without being there, and an automated fix isn’t guaranteed. So, we’re super conservative. And while it’s obviously labor intensive, we think injecting a human into the process at this point provides the user with the best outcome.
20 Jun 2006 at 01:06 am | #
Good post and thanks for the SMART freeware link, looks useful.
20 Jun 2006 at 08:26 am | #
I wouldn’t be shocked that a Parallels disk image would be ok (for some definition of “ok") after losing 4K. It likely landed on a media file (movie, picture, sound) where it would likely cause only a garbled image or sound. It may also have landed on the swap file, or on one of the many backup files Windows keeps, or on another file that’s infrequently used (like help).
20 Jun 2006 at 08:34 am | #
I was basing that comment on the bad luck I’ve had with Virtual PC images and similar types of corruption: about 99.9% of the time, it would totally destroy the disk, which wouldn’t boot or mount.
Parallels storage mechanism might be more robust, or—as you said—it could have just hit a non-critical area…
20 Jun 2006 at 11:58 am | #
Yeah, it’s a tricky situation. I agree that the right answer really is to stop and suggest in the UI that the drive be taken care of by professionals—perhaps SD tech support. (It’s not clear to me whether SD actually makes that specific suggestion to the user or not when it stops. If not, it should.)
Of course, the problem is that there are so few actual professionals in this field, and so many amateurs that like to pretend they are something more. Jon Rentzsch is very smart and was able to figure it out, but I know this isn’t exactly his field, so let’s call him pro-am. Call me a cynic, but I suspect most “Geniuses” at the Apple Store would’ve failed. I don’t know if anyone would have really made it worse, but it would’ve stymied a lot of people and a lot of software too.
There are _real_ professionals like DriveSavers, but they’re supposed to be on the order of $2000 per drive rescued. If you’re not going to pony up that kind of dough, you’ve fallen into a real gray area. Something like Data Rescue is probably your best bet at that point.
20 Jun 2006 at 03:01 pm | #
Interesting background info...thanks for posting. What do I have to do to get Apple to support SMART reporting on my FireWire boot drive, though? That’s always been a pet peeve.
That’s a rhetorical question, by the way, unless anyone has a good answer : )
21 Jun 2006 at 01:10 pm | #
Hey, Drew. The UI basically indicates the error, allows you to investigate the log (which is consciously a bit unfriendly—we put a lot of data in there to provide good information for us, but also to intimidate beginners a little so they don’t get in over their heads), and then a prominent “Send to shirt pocket” button.
The User’s Guide goes through some relatively standard troubleshooting steps, mostly to determine whether the problem is on the source or destination volumes. But we don’t detail how to recover if the file on the source is corrupt, assuming that a user who has a damaged critical file will talk to us or seek professional help.
The best solution, of course, is to have sufficient redundancy—especially for critical files—that a hit like this loses only a few hours’ of work. Hence: multiple, rotated backups… and care.
21 Jun 2006 at 01:12 pm | #
Jamie—yeah, as far as I know, the FireWire drive interface keeps SMART information on the other side of the transaction, and never transmits status. I don’t know why, since I’ve never investigated the low-level workings of FireWire disk interaction, but it’s not just OS X.
27 Jun 2006 at 09:39 pm | #
Thanks for the explanation, but I still think Rentzsch is correct. I recently had the “opportunity” to back up a failing disk with (a registered copy of) Super Duper. My process wound up looking like:
1. Run backup
2. wait 30-40 minutes for error
3. remove/replace troublesome file
4. repeat
I had to iterate through this 3 times, I think, before the disk was completely backed up. I would have much rather SD just kept going and informed me of the problematic files at the end of the process.
Incidentally, no personal files were lost. The errors were in random system files like the Dutch localization for iPhoto and whatnot.
27 Jun 2006 at 10:28 pm | #
Well, it certainly shouldn’t take 30-40 minutes for an error: Smart Update should resume quite quickly. Were you erasing each time?
18 Nov 2006 at 03:17 pm | #
Dave directed me to this thread after I suggested, like others, that SuperDuper log and report errors but continue operation rather than stop.
It’s an interesting issue—Dave essentially wishes to force users to deal with the problem, an approach that certainly has some merit. And SuperDuper provides an easy mechanism for users to request help which is commendable. (The help came quickly, even late at night and on a weekend.)
But I run my backup overnight and first it erases the target drive (which may have contained an older backup). When I come back in the morning, I don’t have a backup and I have a problem that I have to address on my drive. Personally, I’d rather have a big red warning: “Not all your data is backed up!!! Click here for help.” but as many of my other files on the backup as possible.
Since this is really a philosophy issue rather than a technical requirement, why not make it a user preference? “Safer: Stop on Error to get Help” vs. “Convenient: Log and Report Errors but Keep on going.”