[Tfug] Check file system and restore array
Timothy D. Lenz
tlenz at vorgon.com
Sat Feb 9 13:19:22 MST 2013
Looks like I might have some time in the next few days to work on this,
the reason I found and joined this list. Sorry for the long post, alot
of info to explain where it's at. I'm sure there is detail not needed
and stuff needed I I left out. Always seems to be.
I have a linux computer running an older version of debian. It's main
function is a DVR using vdr, but also used for some storage, a Teamspeak
server, and a basic web server for file exchange when something needs to
go to more then one person. It has 4 500Gb seagate drives setup as 2 pairs.
The first pair, sda/sdb has 3 partitions.
md0 is the boot partition with os and other programs.
md1 is a small swap partition
md2 is for data storage, recordings, etc.
sdc/d is a single partition for more storage space.
On the 24th I had recorded a couple of shows, but when I went to watch
the second, it's playback was bad, lots of screen freezing and problems
that can ether be bad signal, or, sometimes vdr or xine or xorg gets a
memory leak and the computer needs to be rebooted. Just restarting vdr
isn't always enough. So I did. 200+ days since file check and it forced
a check but started giving lots of errors wanting to move sectors and
stuff. It was far enough along in the boot that it was logged. It got
past about 50% and went fine from there. When I looked at kern.log,
there where entries for the 20th about sata problems. I don't know why
the drive wasn't failed then and email sent like in the past.
I built the system, but working with the drives/array/file system after
having spent so much time getting it going, stresses me out. Too easy to
mess up and loose it all and I have to go back through notes and ask
lots of question just about every time I work on some part of the system
because I can't remember from one time to the next. So I thought I would
just take it into a shop and let them sort it out. I saved some of the
logs to a stick and provided that with the computer.
kern.log: http://pastebin.com/Tb7f3jS5
dmesg: http://pastebin.com/rP5hTRHX
I have had drives fail at least 4 times in the past. I've always had
problems with seagate drives, so I assumed it was a seagate thing. Most
failures happened after a power down. But when the shop started on the
computer, they said only the fans would power up. The power supply had
gone bad. They replaced that and then found a sata cable was flaky for
sda. I thought they knew linux, but turns out they didn't know that
much. The tech said he disabled the floppy in cmos because it was giving
errors during boot. Well, yea, grub lets you know if the floppy has no
disk before finishing booting from the hard drives. They didn't find any
other problems, but when I got it back and did a cat /proc/mdstat, I
found 3 arrays where down. Also started getting emails confirming it.
(Also found they had turned off AMD cool & quiet and the fan temp
control for both case and cpu fans in cmoss and turned the boot logo
screen back on and turned of memory ECC. Maybe did a cmos reset.)
-------------------------------------------------------------------
The /proc/mdstat file currently contains the following:
Personalities : [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid1 sdb2[1]
4891712 blocks [2/1] [_U]
md2 : active raid1 sdb3[1]
459073344 blocks [2/1] [_U]
md3 : active raid1 sdd1[1] sdc1[0]
488383936 blocks [2/2] [UU]
md0 : active raid1 sdb1[1]
24418688 blocks [2/1] [_U]
unused devices: <none>
-------------------------------------------------------------------
But there may be more :(. I was looking at the logs and noticed that
even though I rebooted and turned ECC back on, the logs still seemed to
show that ECC wasn't supported in cmos and that while the time stamp on
kern.log and others was updating, noting new was being added. I access
the computer though winscp and putty and I know dmesg and others often
dosn't show the latest entry as they seemed to get cached in ram for
awhile before writing to the log. But the other logs always seemed to be
getting updates right away in the past which has me wondering if there
are not other problems now besides degraded arrays.
So I need a way to fairly safely check to make sure it is working
correctly and then need to figure out again adding the drive back in.
for the array part, from my notes I have this:
https://wiki.ubuntu.com/Grub2
https://help.ubuntu.com/community/Grub2
# fail the disk (it's already is (f) so you may skip this step
for a already degraded array)
sudo mdadm --manage /dev/md0 --fail /dev/sdc1
sudo mdadm --manage /dev/md1 --fail /dev/sdc2
sudo mdadm --manage /dev/md2 --fail /dev/sdc3
# remove failed disk (must always be done)
sudo mdadm --manage /dev/md0 --remove /dev/sda1
sudo mdadm --manage /dev/md1 --remove /dev/sda2
sudo mdadm --manage /dev/md2 --remove /dev/sda3
There us no "(f)" showing in mstat, so I'm guessing even though it has
degraded the array, it hasn't failed sda? or removed it from the array yet?
I also have a note from a web page about needing to zero a block on sda
before it can be added back in:
sudo mdadm --zero-superblock /dev/sda3
Assuming I need to do this, does it need to be done for sda1, sda2, and
sda3?
I think this note is from the last time I replaced a drive since I log
in as user, not root and then often have to use sudo:
# add disk to raid array
sudo mdadm --manage /dev/md0 --add /dev/sda1
sudo mdadm --manage /dev/md1 --add /dev/sda2
sudo mdadm --manage /dev/md2 --add /dev/sda3
Then I have this note from an IRC chat for getting it back to bootable:
[14:04] <Jordan_U> Vorg: Ok, then run "grub-install /dev/sda &&
grub-install /dev/sdb" (where sda and sdb are the members of the array)
Think I have to use "su" first to switch to root user though.
---------------------------------------------------------------
Then on a side note, when this happened, It was recommended in an IRC
chat to disable somethnig called ncq. From googling it, it has something
to do with ide to sata or sata to ide, not sure, and that it can cause
drives to drop from an array and slow down raid. I've had it setup this
way for a few years, but that doesn't mean it's not setup right. I used
to rebuild the kernel and update linux every so often, but the new stuff
to the kernel got more and more complex and harder to figure out what
was needed and what wasn't and what shouldn't be used. At one point I
was told that a certain module wasn't needed any more and now the dvd
doesn't work because it was. I have the raid stuff built into the kernel
btw, so that I don't have to mess with ram disk and init what ever
booting from a non-raided partition and switching over to raid. I just
boot from the raid 1 partition.
[22:26] <@sj> i read you should disable ncq
[22:41] <@sj> most of what i read in the last few mins all said either
you have a bad/partially connected sata cable, or you need to disable
ncq.. although a failing drive is always a possibility
[22:45] <@sj> http://ubuntuforums.org/showthread.php?t=1640909&page=2
<-- last post, the guy fixed that problem by updating the microcode
(note, only applies to Intel, I have AMD)
[22:46] <@sj> http://lists.debian.org/debian-user/2009/07/msg02209.html
<-- how to disable ncq
More information about the tfug
mailing list