SEGFault on Kernel(?) ext4(?) or something like that

I have an 11.4 setup with the latest patches installed and all packages updated. The setup is my home server, with 3 HDDs, NFS and SMB. Up to the point that the problem started, I had no issues.

However, during the last couple of weeks, I see a really frustrating issue that really causes me a headache and a completely unstable system.
What I see is that the systems becomes un-responsive, the HTTP service is not OK, the most importantly, the file systems are not OK (file copies of files are failed with no apparent reason, copies are not completed etc). Although fsck reports no errors, I see the following filling the var/log/messages:
Message from syslogd@localhost at Apr 24 22:35:43 …
kernel:[83634.354593] init[1]: segfault at 0 ip 0804c7f6 sp bfb94db0 error 4 in init[8048000+9000]

Message from syslogd@localhost at Apr 24 22:36:13 …
kernel:[83664.385153] init[1]: segfault at 0 ip 0804c7f6 sp bfb94db0 error 4 in init[8048000+9000]

Message from syslogd@localhost at Apr 24 22:36:43 …
kernel:[83694.415720] init[1]: segfault at 0 ip 0804c7f6 sp bfb94db0 error 4 in init[8048000+9000]

Message from syslogd@localhost at Apr 24 22:37:13 …
kernel:[83724.446631] init[1]: segfault at 0 ip 0804c7f6 sp bfb94db0 error 4 in init[8048000+9000]

Any ideas?

On 2012-04-24 21:46, tpe wrote:
> kernel:[83634.354593] init[1]: segfault at 0 ip 0804c7f6 sp bfb94db0
> error 4 in init[8048000+9000]

A segfault in init? Wow. Serious.

What repos do you have?

Have you modified inittab?

Fax service?


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

You might want to tell us more about your computer hardware. Brand,m make, memory CPU, video age and so forth. Over heating can cause odd problems. When was the last time you cleaned out the CPU and video heat sinks? Memory can cause such problems which can be running too fast (ie over-clocking) or perhaps getting too hot or even going bad. Many problems that make little sense can be related to over heating and memory issues.

Thank You,

The hardware is an old Athlon 3000+ running in much lower speed (1200Mhz, instead of stock 2GHz), just because I don’t need it run faster and in order to keep it cool. Sensors does not report any serious heating issues:
asb100-i2c-0-2d
Adapter: SMBus Via Pro adapter at e800
VCore 1: +1.71 V (min = +1.31 V, max = +1.97 V)
+3.3V: +3.15 V (min = +2.96 V, max = +3.63 V)
+5V: +4.81 V (min = +4.49 V, max = +5.51 V)
+12V: +12.10 V (min = +9.55 V, max = +14.41 V)
-12V (reserved):-12.64 V (min = -0.00 V, max = -0.00 V)
-5V (reserved): -5.30 V (min = -0.00 V, max = -0.00 V)
CPU Fan (?): 1339 RPM (min = 664 RPM, div = 8) ALARM
Chassis Fan: 0 RPM (min = -1 RPM, div = 2) ALARM
Power Fan: 0 RPM (min = -1 RPM, div = 2)
M/B Temp: +33.0°C (high = +80.0°C, hyst = +75.0°C)
CPU Temp (AMD): +46.0°C (high = +60.0°C, hyst = +70.0°C)
cpu0_vid: +1.700 V

But, it could be either an HDD or (even worst) a controller issue:



ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       33572630
  3 Spin_Up_Time            0x0003   098   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       230
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       19513094
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       17238
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       115
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   097   000    Old_age   Always       -       34360459275
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   060   054   045    Old_age   Always       -       40 (Min/Max 38/41)
194 Temperature_Celsius     0x0022   040   046   000    Old_age   Always       -       40 (0 16 0 0)
195 Hardware_ECC_Recovered  0x001a   048   023   000    Old_age   Always       -       33572630
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2959232484437
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1638681722
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3957646812

Do you see anything strange?

Have you run a serious memory test? There is one on the install media recommend running over night.

I will do that tonight. Right now, I wait the badblocks results.

On 2012-04-25 05:06, tpe wrote:

> Do you see anything strange?

No…


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

OK, I found it.
Bad memory DIMM… (Although, I cannot understand how a completely passive element such as RAM can fail after some time).

Thanks for the tip guys.

On 2012-04-29 08:06, tpe wrote:
>
> OK, I found it.
> Bad memory DIMM… (Although, I cannot understand how a completely
> passive element such as RAM can fail after some time).

RAM in electronics parlance is considered active. Very active. A resistor
would be passive.


Cheers / Saludos,

Carlos E. R.
(from 11.4 x86_64 “Celadon” at Telcontar)

On Sun, 29 Apr 2012 17:23:07 +0530, Carlos E. R.
<robin_listas@no-mx.forums.opensuse.org> wrote:

> On 2012-04-29 08:06, tpe wrote:
>>
>> OK, I found it.
>> Bad memory DIMM… (Although, I cannot understand how a completely
>> passive element such as RAM can fail after some time).
>
> RAM in electronics parlance is considered active. Very active. A resistor
> would be passive.
>

nothing in this world lasts forever, including RAM. that said, they used
to give lifetime warranty on RAM, but (at least here) they don’t do that
anymore. perhaps they started to engineer RAM to last a specific time
only, like light bulbs and pretty much everything else :frowning:


phani.

Happy to hear you found your problem. Memory problems, or errors to be exact, can be a strange thing to determine, depending on how in manifests itself. Outright bad, and the computer does not boot, may not even power up. These can be easier to find than those that let your PC boot, load the OS, but cause “some” software to work incorrectly. Even had many a time where the bad memory passed all memory diagnostics with flying colors. One thing is for sure, hardware works just fine until it decides to fail, no matter how long it has been in use, at a time and place of its own chosing. At least now, you can get on with using your PC and openSUSE.

Good luck and let us know if we can be of further help to you.

ThankYou,