[Tfug] Distributed storage / how parity in RAID works.

Tue Aug 22 20:34:23 MST 2006

Hi Chris,

*sigh*.. Class time, kids. Pull up a chair, and grab a decent 
calculator. Here's how parity works in RAID. :)

Before I begin, it's important to know that not all RAID types use 
parity--Also, I assure you, no magical psychic unicorns were harmed in 
the making of this example. For the sake of keeping things simple, we're 
going to talk about RAID 3.

Parity relies on a Boolean operator called XOR, or "Exclusive OR".  XOR 
behaves in a "one or the other, but not both" fashion, as seen below:

0 XOR 0 = 0 (False)
0 XOR 1 = 1 (True)
1 XOR 0 = 1 (True)
0 XOR 0 = 0 (False)

For the sake of keeping it simple, lets suppose we had a ridiculously 
tiny, tiiiny RAID 3 set made up of 6 drives, each capable of holding a 
whopping 2 bytes of data. Broken down by purpose, thats four drives to 
store our raw data, a fifth drive to store our parity data (after we 
calculate it), and a sixth drive for use as a "hot spare" i.e. a drive 
that sits around waiting to be called into action should one of the 
other drives fail.

Lets say have a program that saves the string "C7EF011FAB07717B" to 
disk. Striped across the RAID, it would look like so.. Two bytes per drive:

Drive 1 contains "C7EF"
Drive 2 contains "011F"
Drive 3 contains "AB07"
Drive 4 contains "717B"
Drive 5 contains nothing so far.
Drive 6 is the hot spare, ready to be used if another drive dies.

Now keep in mind, when any data is written, parity for that data has to 
be calculated and stored as well.. It's calculated using XOR across the 
width of the RAID set. Whip out a calculator and calculate parity for 
the stripe by hand:

C7EF XOR 011F XOR AB07 XOR 717B  (equals)

Congratulations, you just calculated parity for a RAID stripe. The 
result will be "1C8C". This gets stored on the Drive 5, the parity 
drive.  Now, the RAID set looks like so:

Drive 1 contains "C7EF"
Drive 2 contains "011F"
Drive 3 contains "AB07"
Drive 4 contains "717B"
Drive 5 contains "1C8C"
Drive 6 is the hot spare, ready to be used if another drive dies.

Now, lets suppose a week goes by, and Drive 3 dies in a horrible 
accident involving ham, and a bicycle. Without knowing the contents of 
Drive 3, it's possible to reconstruct it using the parity data we 
calculated earlier when the data was originally written.

To find out what was stored on Drive 3, just take the parity information 
you calculated, and substitute it in for the drive that failed:

C7EF XOR 011F XOR 1C8C XOR 717B (equals)

The result will be...drum roll........ AB07! Tah-dah! The contents of 
Drive 3! Cool, huh?

Magic? Nope.
Psychic unicorns involved? No sir.

Just a clever use of a vastly under-appreciated Boolean operator.

Since one of our drives is dead, covered in ham and bicycle parts, were 
going to need a replacement drive. Thats where the hot spare comes in. 
Bring in the hot spare, write the result (AB07) to it, and declare the 
hot spare as an official member of the RAID set.  Voila--Your RAID has 
been repaired.

What you've just done is exactly the same process that happens in "real 
world" RAID.. It's just that in the real world, the stripe size is far 
bigger, and the RAID set may involve many, many more drives.

While it's an impressive trick, parity will only protect you from 
 >single drive< failures. If you're missing more than one member of that 
RAID set, you don't have enough information for the XOR calculation to 
yeild the missing data. Thats the point I was getting at--If these guys 
are using something like distributed parity across 11 nodes, and one 
node goes down, every single one of those 10 remaining nodes will have 
to provide it's parity data at some point in order to rebuild the 
contents of the node that's been lost. With nodes constantly coming in 
and out of existence, the overhead involved would be insane, bordering 
on sheer---dare I say it--asshattery. Enough to totally make my monocle 
pop out!

*Harumph!*

Cheers,
Bowie

Chris Niswander wrote:

>At 10:46 PM 8/21/2006 -0700, you wrote:
>  
>
>>Two things come to mind here.
>>
>>If they're using something like distributed parity to fill in the gaps 
>>for when one or more of the nodes goes offline, heh, good luck with that 
>>one, chief. Unless the client comes bundled with a magical psychic 
>>unicorn that can come up with the missing data out of it's butt, you're 
>>going to have one hell of a time rebuilding a stripe because the 
>>    
>>
>
>Splitting up a secret into n pieces of information, so that:
>1. with any x (x<n) of those pieces the original secret can be regenerated
>2. fewer than x of those pieces are inadequate to find out anything
>   about the secret.
>is a pretty well established area in cryptography by now.
>
>Try keyphrase: secret sharing
>
>In short, this is a feasible thing to do with backups.
>
>In digression: Bowie, are you trolling? ;-)
>
>Chris
>
>
>
>_______________________________________________
>Tucson Free Unix Group - tfug at tfug.org
>Subscription Options:
>http://www.tfug.org/mailman/listinfo/tfug_tfug.org
>
>  
>