|
||||||
| Forums FAQ | Members List | Search | Today's Posts | Mark Forums Read |
| 64-bit Questions specific to 64 -bit hardware (Software questions should be posted in the appropriate software forums) |
![]() |
|
|
|
LinkBack | Thread Tools | Display Modes |
|
|||
|
All
I have a machine that is randomly pausing. In top the CPU stats say that the CPU I/O Wait State is high, often 100% for several seconds, splitting the display to show individual CPUs it seems that randomly one or more CPUs is high in wait states, 100%, or near abouts, for seconds at a time. top is NOT reporting any application with high utilisation when this happens. I have determined via iotop that the disks are NOT under any load when this happens. I last night ran the memtest off the install CD, it completed 8 scans without error in 11 1/2 hours. My question is, does the CPU I/O also have to stop for network I/O or is there some sort of "buffer" between the two. This machine is being accessed via ssh to run X applications on other machines. So network traffic would be constantly high. Specifications. openSUSE 11.1 64 bit Kernel: (from uname) 2.6.27-29-0.1-default #1 SMP x86-64 CPU: (from /proc/cpuinfo) Intel Core 2 Quad CPU Q8400 @ 2.66 GHz It sees 4 CPUs ![]() Memory: 8 GB Motherboard: Intel P5QL Pro Chipset P43. Disk controller: ICH10 southbridge. Disks: 4 SATA 1TB disks. The / partition is on a md raid mirrored over 2 disks. The other 2 disks are mounted under /mnt All partitions formated ext3 Swap: is on one of the mirrored disks. /boot: is on the other disk of the mirror set. Network: is a PCIe Gigabit LAN Controller connected to a 100 Mbit switch. Cheers Jim |
|
||||
|
That's a relatively high-powered system, so you'd think it would fly. Having 8 Gig of RAM makes a big difference, too.
Try latencytop; it's in the Build repositories. I've never tried posting a "ymp" link here; see if this works: http://software.opensuse.org/ymp/dev...latencytop.ymp. If not, go to "software.opensuse.org," click the "search" item on the left, and enter "latencytop" in the search box. See if latencytop will give you some idea of what's happening and post back here. I'm intrigued. |
|
|||
|
Quote:
Quote:
Quote:
I assume that this is millisecs as this is the unit in the next window when I click on each of these entries. 3.5 secs is a long time in computer terms. Whoops this machine just paused again and this time md0_raid1, specifically Raid resync kernel thread, was up at 17700 ms this time. I wonder if it is indeed a disk issue. Might check the smart stats and see if one of them is having issues. Thanks again for the heads up on the latencytop tool. Cheers Jim |
|
|||
|
Well this has me puzzled.
I ran some tests on the disks using the smartctl utility. All disks pass and no errors being reported. I then used the hdparm -T utility to test the cache read performance. This also seems fine with each disk reporting throughput of approx 1.6 GB/s Even the /dev/md0 reports 1.5 GB/s and this was during a period of high wait states. I am guessing, as its output paused, that it is also freezing and does not see the overall time. I had top running and the wait states went to around 80% or so, on 3 of the 4 processors. When I get a chance later tonight I am going to have a closer look at the BIOS settings, reading the manual I notice that there are a series of overclocking settings, wondering if the machine builder got a bit too keen on some of these settings? Jim Test results to follow (too big to fit in this post) ... |
|
|||
|
Output from tests (Part 1):
# hdparm -T /dev/sda /dev/sda: Timing cached reads: 3202 MB in 2.00 seconds = 1600.75 MB/sec /dev/sda: Timing cached reads: 3238 MB in 2.00 seconds = 1619.27 MB/sec /dev/sda: Timing cached reads: 3286 MB in 2.00 seconds = 1642.61 MB/sec /dev/sda: Timing cached reads: 3162 MB in 2.00 seconds = 1581.46 MB/sec /dev/sda: Timing cached reads: 3118 MB in 2.00 seconds = 1558.72 MB/sec /dev/sda: Timing cached reads: 3234 MB in 2.00 seconds = 1617.50 MB/sec # hdparm -T /dev/sdb /dev/sdb: Timing cached reads: 3248 MB in 2.00 seconds = 1624.35 MB/sec /dev/sdb: Timing cached reads: 3250 MB in 2.00 seconds = 1624.96 MB/sec /dev/sdb: Timing cached reads: 3240 MB in 2.00 seconds = 1619.67 MB/sec # hdparm -T /dev/sdc /dev/sdc: Timing cached reads: 3238 MB in 2.00 seconds = 1618.84 MB/sec /dev/sdc: Timing cached reads: 3268 MB in 2.00 seconds = 1633.55 MB/sec /dev/sdc: Timing cached reads: 3276 MB in 2.00 seconds = 1638.22 MB/sec /dev/sdc: Timing cached reads: 3258 MB in 2.00 seconds = 1629.39 MB/sec # hdparm -T /dev/sdd /dev/sdd: Timing cached reads: 3224 MB in 2.00 seconds = 1611.91 MB/sec /dev/sdd: Timing cached reads: 3222 MB in 2.00 seconds = 1610.95 MB/sec /dev/sdd: Timing cached reads: 3226 MB in 2.00 seconds = 1612.82 MB/sec /dev/sdd: Timing cached reads: 3258 MB in 2.00 seconds = 1629.37 MB/sec # hdparm -T /dev/md0 /dev/md0: Timing cached reads: 3200 MB in 2.00 seconds = 1599.65 MB/sec /dev/md0: Timing cached reads: 3264 MB in 2.00 seconds = 1631.64 MB/sec /dev/md0: Timing cached reads: 3162 MB in 2.00 seconds = 1580.80 MB/sec |
|
|||
|
Output of tests (Part 2):
Code:
# smartctl -a /dev/sda smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM) Copyright (C) 2002-8 by Bruce Allen, smartmontools Home Page (last updated $Date: 2009-09-14 01:43:11 +0200 (Mon, 14 Sep 2009) $) === START OF INFORMATION SECTION === Device Model: WDC WD10EADS-00M2B0 Serial Number: WD-WCAV51010226 Firmware Version: 01.00A01 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Oct 4 11:50:50 2009 NZDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (20400) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 235) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 115 112 021 Pre-fail Always - 7241 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 16 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 903 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 11 193 Load_Cycle_Count 0x0032 198 198 000 Old_age Always - 7205 194 Temperature_Celsius 0x0022 118 114 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. # smartctl -a /dev/sdb smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM) Copyright (C) 2002-8 by Bruce Allen, smartmontools Home Page (last updated $Date: 2009-09-14 01:43:11 +0200 (Mon, 14 Sep 2009) $) === START OF INFORMATION SECTION === Device Model: WDC WD10EADS-00M2B0 Serial Number: WD-WCAV51028064 Firmware Version: 01.00A01 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Oct 4 11:52:54 2009 NZDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (21600) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 248) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 109 109 021 Pre-fail Always - 7508 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 16 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 903 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 11 193 Load_Cycle_Count 0x0032 198 198 000 Old_age Always - 7091 194 Temperature_Celsius 0x0022 121 112 000 Old_age Always - 26 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. |
|
|||
|
Output of tests (Part 3)
Code:
# smartctl -a /dev/sdc
smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: WDC WD10EADS-00M2B0
Serial Number: WD-WCAV51010266
Firmware Version: 01.00A01
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Oct 4 11:52:59 2009 NZDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (21600) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 248) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 117 117 021 Pre-fail Always - 7133
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 16
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 902
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 12
193 Load_Cycle_Count 0x0032 196 196 000 Old_age Always - 14632
194 Temperature_Celsius 0x0022 121 116 000 Old_age Always - 26
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
# smartctl -a /dev/sdd
smartctl 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: WDC WD10EADS-00M2B0
Serial Number: WD-WCAV51025602
Firmware Version: 01.00A01
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Oct 4 11:53:04 2009 NZDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (19200) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 221) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 107 107 021 Pre-fail Always - 7608
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 16
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 897
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 12
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2208
194 Temperature_Celsius 0x0022 121 116 000 Old_age Always - 26
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
|
|
|||
|
Checked the BIOS settings and all the Overclocking stuff was set to defaults of Auto. So nothing suspicious there.
One thing the BIOS does have is a Hardware Monitor screen showing things like Fan Speeds and CPU Temps etc. CPU is running at approx 36 C, so not too hot. I would not be worrying too much about the CPU wait states being too high except for the fact that the whole machine freezes for seconds at a time. Which makes it almost unusable when trying to do real work. ![]() Jim |
|
|||
|
After the following changes I can still not find what is going on.
Running latencytop over several days and watching it during hangs I noticed that it was showing high latency during disk "stuff" fsync(), Writing page to disk, etc. I also noticed it is mentioning Page Fault. Which I know as a memory operation, to do with swap, but the machine has 8 GB of mem and the swap is empty, so I might be wrong there. Anyway I have tried the following - with results included: Found a mailing list post at: Linux-Kernel Archive: Re: Finding what is stuck... which mentions using: echo noop > /sys/block/sda/queue/schedular which changes the schedular to use noop instead of CFQ, did this all all disks. No difference. changed back to CFQ =//= Mounted the disks as ext2 No difference Mounted back as ext3 =//= Reran the Memory test 7 Passes in 9 hours or so. No Errors =//= Started the machine using the Fail Safe option No Difference Rebooted back to Normal =//= Updated BIOS from v1001 to v1004 No difference =//= Updated the CPU microcode downloaded from Intel website No difference =//= Now I am running out of ideas... Jim |
![]() |
|
| Bookmarks |
| Thread Tools | |
| Display Modes | |
|
|