Hello,
We have a four socket amd machine, running barcelona processors, with 64gb ram.
The system runs for extended periods just fine when the system is running up to or below the 64gb memory limit. A typical load on the machine has short periods where the machine uses heavy amounts of swap space (30+ Gb). We have a swap partition of around 96Gb. When we push the machine into heavy swapping, the machine will fail within 24hrs.
Has anyone experienced this problem and is there a solution other than buying more physical memory?
Or am I wrong and maybe the physical memory is the issue? I thought maybe it was the memory itself, and after stripping the memory down, I get the same problem…failure upon heavy swapping.
very interesting…i wonder if you are running openSUSE…though i
reckon it is a generic Linux kernel issue (and not SUSE specific)…
or maybe not an issue at all…i mean, if you have (say) a bucket
which holds exactly 10 liters, i can predict it will “fail” right
about the time you begin pouring the eleventh liter of water in…and,
there will be water on the floor.
unless you have someone dipping in to take some out…and storing it
in a spare bucket (swap?)…
given that, you can pour 11, 12, 25 who knows how many liters, as long
as that someone dips fast enough to stay ahead of the one pouring…
but, with some confidence i can predict there will be water on the
floor if your dipper falls behind…(primary [RAM] and secondary
[swap] bucket failure)
alternatives?
-bigger bucket (buy RAM)
-dip out faster (hire more dippers [CPUs?] and/or tune kernel?)
-pour more slowly (don’t push into heavy swapping)
-employ magic (hmmmmmm, good luck)
did you custom compile the kernel?
did you see the same symptom with an earlier kernel?
have you asked in a place where a (real) kernel hacker might read and
respond? (i kinda doubt you find one here…certainly i have no idea
what i’m talking about…but, hang around and eventually someone who
knows might speak up…)
–
palladium
Ubuntu is an African word meaning “I can’t set up Debian.”
What applications are you running? Does anything have a serious memory leak that
is consuming all available RAM and swap space? Running out of memory will cause
the system to fail, even though there are routines to kill processes in an
attempt to keep going.
You should install the “sysstat” package which allows you to gather a large amount of system information: mem usage, swap usage, page faults, interupts, kerel avtivity, cpu util, io stats, etc.
You start gathering information while running you job that (eventually) causes you system to fail; following the failure you will have a binary object file that you load into the graphical analysis tool “sysstat-isag” or “ksar”. This with provide you with an excellent method to peer inside the status of your system during its failure. I use sysstat professionally to analyze the behaviour of very large clustered Linux systems (100-500 nodes) and we have found this to be an extremely useful tool.
Let me know if you need assistance – glad to provide more info if needed.