EXT4 corrupt, repair fails too. Ideas?

STurtle · February 24, 2010, 12:15am

Hello, both root and home partition have suddenly become corrupt, and the repair tool from the installation disk just loops (Do you want to repair? Yes! Do you want to repair? Yes! Do you…).

Any ideas what this error message at boot means:

 54.209685] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
 54.209758] ata1.00: irq_stat 0x40000008
 54.209828] ata1.00: cmd 60/00:00:1f:94:88/01:00:06:00:00/40 tag 0 ncq 131072 in
 54.209829]          res 41/40:00:1b:95:88/87:01:06:00:00/00 Emask 0x409 (media error) <F>
 54.209923] ata1.00: status: { DRDY ERR }
 54.209970] ata1.00: error: { UNC }

This is block repeated several times with different numbers, and then eventually


 57.264842] end_request: I/O errir, dev sda, sector 109614363
 57.264922] JBD: Failed to read block at offset 1541
 57.265053] EXT4-fs (sda5): error loading journal
mount: wrong fs type, bad option, bad superblock on /dev/sda5,
       missing codepage or helper program, or other error

$

The $ prompt is pretty useless, I cannot get to any sensible system.

I can still boot and run Windows Vista on /dev/sda3, but this is only a small 20GB partition that I do not use for anything. Is this of any use?

How do I read the S.M.A.R.T. values now? From Vista?

EDIT: I do have a comprehensive daily backup of my home partition, but I wonder how to determine whether it is worth trying to reinstall or whether the harddisk is just dead.

ken_yap · February 24, 2010, 12:41am

Those sorts of hardware messages are severe and point to hardware failure. It could also be the I/O channel, so if you have another machine try moving the disk to that. Alternatively put a known good disk on the channel and see if you can run with that. That’ll will tell you if the disk is a brick now.

STurtle · February 24, 2010, 12:45am

Thanks for the advice! Yes, I have a colleague with a near identical laptop, that I might convince to allow me to swap the drives.

I was just wondering whether there is a generic or easy way to access the disk’s SMART data, to see what is logged there.

It also appears weird to me that the Windows partition is working fine, but that only the EXT4 partitions are affected?

gogalthorp · February 24, 2010, 12:46am

TO know if it is the drive you need to run a low level scan. I use Spinrite but it is not a free program. You can normally get a scan app from your drive manufacturer’s site.

You can try to repair the file system with fsck program. Note this only works on unmounted drives so needs to be done from a DVD/CD boot. The repair function on the install will only repair small problems. If you need to manually repair chances are good you have lost some files even if you manage to repair. Read up on fsck ie man fsck before you try to use it. If there is damage on root just figure on a new install. Any piece of files and directories found during the fix will be found in lost&found.

ken_yap · February 24, 2010, 12:51am

You could run smartctl off a Live CD. I suspect if you do a search you will find some kind of SMART utility for Windows, since it’s implemented in the disk firmware and the program just sends the commands to the disk.

It also appears weird to me that the Windows partition is working fine, but that only the EXT4 partitions are affected?

Not strange at all. It could also just as easily been the Windows partition that lost crucial blocks. It looks like certain areas of the disk are bad, making it more likely it’s the disk and not the channel.

STurtle · February 24, 2010, 5:05pm

I just reinstalled 11.2 with no problems, without changing anything.

It mounted my home partition without any hickup, a feat that the repair option from the same 11.2 DVD did not accomplish. All data seems intact too, no difference to my backup.

The SMART extended offline test just completed. No problems. Good Health.

I am puzzled about what happened that caused the above. Could it be because I used WRITEBACK journalling instead of ORDERED?

gogalthorp · February 24, 2010, 5:10pm

Power failure? Bouncing a notebook while operating??? Act of god??

STurtle · February 24, 2010, 10:10pm

How can I force a full check of my harddisk?

sudo shutdown -rF now

does not really seem to trigger an extensive test for EXT4. And

sudo smartctl -test long

also finished without anything.

Sure, but I would exepect that those could be repaired by fsck.

Apparently, the Suse reinstall could mount the home partition without troubles, and smartctl shows no defect sectors either, which I would expect after a bump (which wasnt):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   099   006    Pre-fail  Always       -       15434359 
  3 Spin_Up_Time            0x0003   098   097   085    Pre-fail  Always       -       0        
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1542     
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0        
  7 Seek_Error_Rate         0x000f   066   060   030    Pre-fail  Always       -       77382667519                                                                                                
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3712      
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       10        
 12 Power_Cycle_Count       0x0032   099   037   020    Old_age   Always       -       1542      
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0         
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       103       
188 Unknown_Attribute       0x0032   100   097   000    Old_age   Always       -       21        
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0         
190 Airflow_Temperature_Cel 0x0022   054   032   045    Old_age   Always   In_the_past 46 (0 6 46 29)                                                                                             
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0         
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       996       
193 Load_Cycle_Count        0x0032   079   079   000    Old_age   Always       -       43857     
194 Temperature_Celsius     0x0022   046   068   000    Old_age   Always       -       46 (0 13 0 0)                                                                                              
195 Hardware_ECC_Recovered  0x001a   035   032   000    Old_age   Always       -       15434359  
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0         
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0         
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0         
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       124463857278634                                                                                            
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3509712395
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2668931560
254 Unknown_Attribute       0x0032   001   001   000    Old_age   Always       -       21

So apparently the HDD overheated, which might have happend with the laptop not shutting down properly while being in its bag. So is there no permanent damage, then?

gogalthorp · February 24, 2010, 10:16pm

The only way to tell is to do a low level scan of the disk to check for damaged sectors.

The only way to thoroughly check the file system is to run fsck on the partition and the partition must not be mounted so you must do it from a bootable CD try gparted

type
man fsck
at CL to see the manual

gropiuskalle · February 25, 2010, 10:43am

How can I force a full check of my harddisk?

Like this:

touch /forcefsck

…then reboot. The ‘shutdown -rF now’-command should work too, but I suppose this is one of the many cases where ‘sudo’ is not working - let me take the opportunity to point out that ‘sudo’ is often not adequate to “switch to root”, its intention is way different from ‘su’ and also needs proper configuration of /etc/sudoers to work as intended. SuSE != Ubuntu.

STurtle · February 25, 2010, 11:25am

Well, it does have an effect, I can see that fsck is executed. However for EXT4 the check only takes a few seconds instead of half an hour as with EXT3. So how thorough can this check be?

Thanks, I will try that and see whether fsck can be convinced to do something more laborious then.

However, I thought that during boot fsck is called before mounting the partition already? (Provided it is not the root partition.)

I know. I usually open a root konsole and very rarely use sudo.
Ever since I got a lecture by oldcpu in this forum that being root is a bad no no, I just write “sudo” instead of “as root” or similar when I post here.
(I think oldcpu thought I was logged in as root when I copy & pasted something from a root konsole.)

gropiuskalle · February 25, 2010, 11:29am

Well, it does have an effect, I can see that fsck is executed. However for EXT4 the check only takes a few seconds instead of half an hour as with EXT3. So how thorough can this check be?

It’s not a bug, it’s a feature. One of the advantages of ext4 is that checks are performed at much better speed - and yet be reliable.

STurtle · February 25, 2010, 11:44am

Yes, I understand that. This is certainly ok under normal conditions, when there is a hard shutdown, etc.

However, I fear that the overheating of the disk might have damaged some parts my data that were not used at that moment, so the journal would not know about those errors,. Hence I would like the CRC of all data to be checked.

STurtle · February 27, 2010, 5:56pm

Hm, so I did
e2fsck -c -c -D -f -p -v
on both (unmounted) partitions (took more than 6 hours) and
smartctl -t long /dev/sda.
Neither reported any errors nor bad blocks.

So I guess it is safe to continue using my harddisk then, despite this glitch, or is there another check I can perform?