Crazy idle load average

Greetings Suse community. I’m looking for some advice finding the problem with a work server. I’m new to linux sys admin work so I don’t know all the commands and in/outs. I believe this is a hardware issue so I’m posting it here. I was given an hp proliant server that was previously taken out of use due to unreliable performance. While working the load averages would spike it would lag out. I pulled each HD and formatted it, I also ran memtest for about 9 passes and it didn’t find any issues.

I installed the newer release of openSUSE (compared to what the server had) and installed the call center software on it (ViciDial). While sitting idle the load averages are insane. The example below was from last night. I’ve watched it go up to a load of 40 where I no longer had any control or down to .5 (lowest I’ve seen). The server was idle and not in use. Can someone please give me ideas on where to look for the trouble? Thanks

ViCiCluster:~ # top
top - 01:24:17 up 16:37, 6 user, load average: 16.84, 20.66, 14.78
Tasks: 163 total, 1 running, 162 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.3%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 100.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.3%us, 0.6%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.3%sy, 0.0%ni, 0.0%id, 99.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12038M total, 3473M used, 8564M free, 117M buffers
Swap: 4101M total, 0M used, 4101M free, 2807M cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20466 root 20 0 8860 1168 860 R 1 0.0 1:45.46 top
27520 mysql 20 0 3933m 284m 6528 S 1 2.4 169:53.82 mysqld
1 root 20 0 36972 4172 2000 S 0 0.0 0:03.71 systemd
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:00.14 ksoftirqd/0
5 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/u:0
6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
7 root RT 0 0 0 0 S 0 0.0 0:00.11 watchdog/0
8 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1
10 root 20 0 0 0 0 S 0 0.0 0:00.22 ksoftirqd/1

Hello and welcome here.

As newcomer you could not know, but to avoid the broken layout in the computer text, you need to copy/paste it between CODE tags. To get CODE tags, click the # button in the tool bar of the post editor.

Thank you for looking into this.

> Greetings Suse community.

-=WELCOME=- new poster

> I’m looking for some advice finding the problem with a work server.

by “work server” do you mean a server at your work place, or a server
you intend to put to work (at home?)

> I’m new to linux sys admin work so I don’t know all the commands and in/outs.

are you a Oracle, Solaris or something else admin?? Windows maybe?
did you go to school for that?

> I believe this is a hardware issue so
> I’m posting it here. I was given an hp proliant server

is there a model number on that server

> that was
> previously taken out of use due to unreliable performance. While working
> the load averages would spike it would lag out.

i’ve never heard that expression before, please explain what you mean
by “lag out”? did the machine throw an error, what was it? did it
freeze? LED flash? what is “lag out”?

and, “while working” what? serving a video stream, crunching the wave
propagation of a nuclear explosion, serving http, what?

what was the memory usage when it would “lag out”? did top show a few
processes using all the CPU? were there any hints in the logs?

> I pulled each HD and formatted it

how many harddrives? formatted to what file system…

> I also ran memtest for about 9 passes and it didn’t find
> any issues.

how much RAM do you have? if you have just 1 GB you need to run
memtest overnight, at least…9 passes is nothing.

> I installed the newer release of openSUSE (compared to what the server
> had)

ok, can you tell us what ran before and which version you have
installed now?

from DVD or Live CD?

and, did you install a desktop environment?

> and installed the call center software on it (ViciDial). While
> sitting idle the load averages are insane.

ok, all you copied and pasted in is unusable…look at how it looks
in your posting–impossible to read (but i know you didn’t know it
would turn out looking like that, so do it again but this time put
the data inside “code tags” by following the directions here:
http://goo.gl/i3wnr

and, please tell us more specifics on your hardware–is it in
warranty? in what year was the set first put to use? how about the
power supply units, have they been tested and are they giving steady
power of sufficient quantity (i see you have 8 CPUs and i wonder how
many hard disks and i wonder if power is clean and steady???

if the machine was taken out of service because it was unreliable was
there any indication it was a software problem, at all? is it doing
now as it did with the earlier software? if so why not send it to a
repair shop first?

well, have you looked to see if the logs (in /var/log/messages) give
any hints, like errors or warnings?

since it seems broken on two different sets of software…there seems
to be no sense in wasting time if the hardware is broken–and
deciding what might be from here would be nothing more than a wild
guess:

  • ram
  • CPU(s)
  • loose ground(s)
  • leaking capacitors
  • bad cables
  • cracked circuit board
  • power supply unit
  • network adapter

but, a competent repair shop should . . .


dd
openSUSE®, the “German Engineered Automobile” of operating systems!
http://tinyurl.com/DD-Caveat

On Sat, 16 Feb 2013 20:36:02 +0000, blacknexus wrote:

> The example below was from last night. I’ve watched it go up to a load
> of 40 where I no longer had any control or down to .5 (lowest I’ve
> seen). The server was idle and not in use. Can someone please give me
> ideas on where to look for the trouble?

Please post the output of commands in code tags (advanced editor->#
button) to help with readability.

Load averages have to do with I/O blocking processes from running,
generally. High load averages are symptomatic of a disk I/O bottleneck,
and that’s not something top is going to help identify because top
reports on CPU-bound processes.

iostat can help identify the device that is having the performance
bottleneck. Run that command and note the %iowait column, when the
loadavg is high, you /should/ see a device that has a high %iowait value

  • that’ll help identify what’s going on.

You might also run it with the -x parameter to get some extended
statistics. If the %iowait value is low, perhaps a high await value
(shown with -x) might indicate something else is going on that needs to
be diagnosed.

The good news is that it looks like it isn’t swap that’s causing the
problem - looks like about 12 GB of memory in the system and no swap in
use, so that’s not causing a bottleneck.

Jim


Jim Henderson
openSUSE Forums Administrator
Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

Thank you for the replies.

Answer to hcvv: much appreciated. In the future I will use the CODE tag.

Answer to dd:

  1. It’s a server at work.
  2. I come from a windows based software background.
  3. Hardware stats:
    Xeon E5345 x 2
    12 GB Ram
    750 GB hot swap HD x 4 (currently only using one I believe, testing purposes, no raid)
    Dual NIC, dual power supply
  4. Lag out meaning while being connected SSH I couldn’t do anything. I couldn’t CTRL + C out of top. Had to wait for the load average to drop.
  5. I formatted them into NTFS then let Linux format when installing SUSE.
  6. Current version is openSUSE 12.1 downloaded from site, burned to disk, installed from disk. No desktop environment. Unknown previous version. Previous employee was found stealing and was fired. I was brought in later.
  7. From my top info it doesn’t show CPU use by any process.
  8. Expired warranty
    A lot of questions but I believe I answered most of them.

Answer to hendersj:

I used top to watch the CPUs. From some searching I did last night I decided to use iotop. The only thing the shows up consistently with usage is below. It may be 20%, it may be 99% (usually 99). It only shows up for a second and is gone. But it returns a second later. Since it’s not constant I’m not sure it’s a concern.

 TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  338 be/3 root        0.00 B/s    0.00 B/s  0.00 % 23.48 % [jbd2/cciss!c0d0]

Thank you for the suggestions hendersj, I’ll work on those today.

On Sun, 17 Feb 2013 00:36:01 +0000, blacknexus wrote:

> Thank you for the suggestions hendersj, I’ll work on those today.

Glad to help out. iotop was a tool I was thinking of but couldn’t
remember the name of. :slight_smile:

Jim


Jim Henderson
openSUSE Forums Administrator
Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

> last night I decided to use iotop

i’m pretty much stymied, but since you are not seeing any hints in
top or iotop it is probably gonna be beneficial to have a look at the
logs…i know those are daunting but i guess there must be some
bread crumbs to problems in there…

you can look yourself easily with this command in a user terminal (if
already in root terminal, just leave out ‘sudo’)


sudo tail -n100 /var/log/messages | less

that will ask you for the root password and show the last 100 lines
(wanna see 200? go for it; 10 lines is the default if you omit the
-nXXX) of the file named messages, and ‘less’ lets you scroll up and
down…

press q to Quit the scroll and return to command prompt.

i kinda expect you to see hundreds (or thousands) of identical (or
nearly so) lines of complaints/errors/warnings from or about
something going on and . . .

capture some representatives of those complaints and post them
between code tags in this thread (anytime in these forums if you have
more than a reasonable amount of stuff (or images) to post, use the
facility at http://susepaste.org/ and return the URL to the thread)…

maybe someone can figure out what is going on…otherwise if there is
nothing in the logs i’m pretty sure you are gonna need a mechanic
with some test tools to find the eFlaw…

oh! maybe atop will show something…it adds a view into networking,
maybe those to NICs are at war and using up all the ‘load’…


dd
openSUSE®, the “German Engineered Automobile” of operating systems!
http://tinyurl.com/DD-Caveat

I left it idle all saturday and just watched the top. Load average was between .5 and 1.5 at all times. Higher than I believe it should be but not terribly. This morning (sunday) I had it backup the SQL DB. This is what I recorded with top, iotop, and iostat.

To me it almost appears to be doing something else, such as buffer the data while the load average increases and then when it actually starts some I/O action the load begins to decreases. Hope this information is useful.

As for the log comment from dd. I’ll be looking into the next.

Top Information

ViCiCluster:~ # top c
top - 10:30:33 up 1 day,  7:59,  1 user,  load average: 77.04, 52.31, 25.79
Tasks: 312 total,   1 running, 311 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.7%sy,  0.0%ni,  0.0%id, 99.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.7%us,  1.6%sy,  0.0%ni,  0.0%id, 97.7%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:     12038M total,     4873M used,     7165M free,      167M buffers
Swap:     4101M total,        0M used,     4101M free,     3594M cached


  PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
16378 root      20   0  8992 1428  960 R      0  0.0   0:02.46 top c
27520 mysql     20   0 3933m 642m 6644 S      0  5.3  70:23.76 /usr/sbin/mysqld
    1 root      20   0 36972 4172 2000 S      0  0.0   0:05.89 /sbin/init showop
    2 root      20   0     0    0    0 S      0  0.0   0:00.01 [kthreadd]
    3 root      20   0     0    0    0 S      0  0.0   0:00.43 [ksoftirqd/0]
    5 root      20   0     0    0    0 S      0  0.0   0:00.00 [kworker/u:0]
    6 root      RT   0     0    0    0 S      0  0.0   0:00.00 [migration/0]
    7 root      RT   0     0    0    0 S      0  0.0   0:00.34 [watchdog/0]
    8 root      RT   0     0    0    0 S      0  0.0   0:00.00 [migration/1]
   10 root      20   0     0    0    0 S      0  0.0   0:00.51 [ksoftirqd/1]

Backup Failed, Second attempt


ViCiCluster:~ # top c
top - 10:39:47 up 1 day,  8:09,  3 users,  load average: 24.47, 21.29, 20.13
Tasks: 198 total,   1 running, 197 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  3.0%us,  0.4%sy,  0.0%ni, 96.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.3%us,  0.3%sy,  0.0%ni,  0.0%id, 99.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:     12038M total,     4632M used,     7405M free,      167M buffers
Swap:     4101M total,        0M used,     4101M free,     3463M cached


  PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
18157 root      20   0 61176  10m 3416 S      2  0.1   0:11.63 /usr/bin/python /
   69 root      20   0     0    0    0 S      0  0.0   0:03.89 [kworker/1:2]
16378 root      20   0  8992 1428  960 R      0  0.0   0:03.60 top c
27520 mysql     20   0 3933m 648m 6644 S      0  5.4  71:13.38 /usr/sbin/mysqld
    1 root      20   0 36972 4172 2000 S      0  0.0   0:05.91 /sbin/init showop
    2 root      20   0     0    0    0 S      0  0.0   0:00.01 [kthreadd]
    3 root      20   0     0    0    0 S      0  0.0   0:00.43 [ksoftirqd/0]
    5 root      20   0     0    0    0 S      0  0.0   0:00.00 [kworker/u:0]
    6 root      RT   0     0    0    0 S      0  0.0   0:00.00 [migration/0]
    7 root      RT   0     0    0    0 S      0  0.0   0:00.34 [watchdog/0]

iotop Information

While Load Average is increasing

ViCiCluster:~ # iotop
Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init showopts
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
    5 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/u:0]
    6 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
    7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/0]
    8 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
   10 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
   11 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:1]
   12 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/1]
   13 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
   15 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/2]
   16 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/2]
   17 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/3]
   19 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/3]
   20 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/3]
   21 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
   23 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/4]
   24 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/4]
   25 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]
   27 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/5]
   28 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/5]

While Load Average is decreasing

ViCiCluster:~ # iotop
Total DISK READ: 0.00 B/s | Total DISK WRITE: 94.09 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  338 be/3 root        0.00 B/s   54.89 K/s  0.00 % 72.70 % [jbd2/cciss!c0d0]
  763 be/4 root        0.00 B/s    3.92 K/s  0.00 %  0.00 % rsyslogd ~yslog.conf
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init showopts
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
    5 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/u:0]
    6 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
    7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/0]
    8 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
   10 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
   11 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:1]
   12 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/1]
   13 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
   15 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/2]
   16 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/2]
   17 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/3]
   19 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/3]
   20 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/3]
   21 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
   23 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/4]
   24 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/4]
   25 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]

iostat Information

ViCiCluster:~ # iostat -x
Linux 3.1.10-1.16-default (ViCiCluster)         02/17/13        _x86_64_        (8 CPU)


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.76    0.00    0.13    5.44    0.00   93.67


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
cciss/c0d0        0.04     2.86    0.19    2.05    20.91    65.06    76.60    10.63 4010.17  195.10 4370.00 153.84  34.53

Just an update. From my previous deduction of the delay being between the command and the I/O execution of the command I went back and looked at the ram again. I did find a stick was slowing everything down but didn’t throw an error. I removed it and the server is operating beautifully. Thanks for all the help guys! I’m still learning the sys admin ropes.

On 2013-02-19 21:26, blacknexus wrote:
>
> Just an update. From my previous deduction of the delay being between
> the command and the I/O execution of the command I went back and looked
> at the ram again. I did find a stick was slowing everything down but
> didn’t throw an error.

How did you notice that?


Cheers / Saludos,

Carlos E. R.
(from 12.1 x86_64 “Asparagus” at Telcontar)

On Tue, 19 Feb 2013 20:26:01 +0000, blacknexus wrote:

> Just an update. From my previous deduction of the delay being between
> the command and the I/O execution of the command I went back and looked
> at the ram again. I did find a stick was slowing everything down but
> didn’t throw an error. I removed it and the server is operating
> beautifully. Thanks for all the help guys! I’m still learning the sys
> admin ropes.

Fantastic, glad to hear you found the cause. :slight_smile:

Jim


Jim Henderson
openSUSE Forums Administrator
Forum Use Terms & Conditions at http://tinyurl.com/openSUSE-T-C

Use iotop to watch the disk read/write speeds.

On 2013-03-04 18:06, blacknexus wrote:
>
> robin_listas;2528629 Wrote:
>> On 2013-02-19 21:26, blacknexus wrote:
>>>
>>> Just an update. From my previous deduction of the delay being between
>>> the command and the I/O execution of the command I went back and looked
>>> at the ram again. I did find a stick was slowing everything down but
>>> didn’t throw an error.
>>
>> How did you notice that?
>
> Use iotop to watch the disk read/write speeds.

Yes, I know you used iotop, but I don’t see how you related that to a
bad ram stick. Just curious. :-?


Cheers / Saludos,

Carlos E. R.
(from 11.4, with Evergreen, x86_64 “Celadon” (Minas Tirith))