AGGH! This is almost unusable - it can take 10 seconds for a webpage to decide if its going to respond to a click or other action!!
Thanks for the confidence! I am glad to hear they should work. I will have to remove the 128GB add-on to speed this system up to usability. But then I will follow some of your leads. I do have a call into Supermicro, but I likely won’t get more than general advice since I have Kingston ECC Server memory (they “don’t support” that brand).
0.000000] initial memory mapped : 0 - 20000000
0.000000] Base memory trampoline at [ffff880000099000] 99000 size 20480
0.000000] init_memory_mapping: 0000000000000000-00000000b7e60000
0.000000] init_memory_mapping: 0000000100000000-0000004048000000
0.000000] PM: Registered nosave memory: 000000000009e000 - 000000000009f000
0.000000] PM: Registered nosave memory: 000000000009f000 - 00000000000a0000
0.000000] PM: Registered nosave memory: 00000000000a0000 - 00000000000e8000
0.000000] PM: Registered nosave memory: 00000000000e8000 - 0000000000100000
0.000000] PM: Registered nosave memory: 00000000b7e60000 - 00000000b7e6e000
0.000000] PM: Registered nosave memory: 00000000b7e6e000 - 00000000b7e70000
0.000000] PM: Registered nosave memory: 00000000b7e70000 - 00000000b7e94000
0.000000] PM: Registered nosave memory: 00000000b7e94000 - 00000000b7ec0000
0.000000] PM: Registered nosave memory: 00000000b7ec0000 - 00000000b7ee0000
0.000000] PM: Registered nosave memory: 00000000b7ee0000 - 00000000b7eed000
0.000000] PM: Registered nosave memory: 00000000b7eed000 - 00000000b8000000
0.000000] PM: Registered nosave memory: 00000000b8000000 - 00000000e0000000
0.000000] PM: Registered nosave memory: 00000000e0000000 - 00000000f0000000
0.000000] PM: Registered nosave memory: 00000000f0000000 - 00000000ffe00000
0.000000] PM: Registered nosave memory: 00000000ffe00000 - 0000000100000000
0.000000] Your BIOS doesn't leave a aperture memory hole
0.000000] PM: Registered nosave memory: 00000000ac000000 - 00000000b0000000
0.000000] Memory: 264594056k/269615104k available (5809k kernel code, 1181768k absent, 3839280k reserved, 7759k data, 940k init)
0.000000] please try 'cgroup_disable=memory' option if you don't want memory cgroups
0.219721] Initializing cgroup subsys memory
5.522198] Freeing initrd memory: 12124k freed
6.203513] Non-volatile memory driver v1.3
7.154457] Freeing unused kernel memory: 940k freed
7.156217] Freeing unused kernel memory: 316k freed
7.162309] Freeing unused kernel memory: 2012k freed
8.957049] [drm] nouveau 0000:03:00.0: 0: memory 135MHz core 135MHz shader 270MHz voltage 850mV timing 0
8.957053] [drm] nouveau 0000:03:00.0: 1: memory 405MHz core 405MHz shader 810MHz voltage 900mV timing 1
8.957056] [drm] nouveau 0000:03:00.0: 3: memory 600MHz core 589MHz shader 1402MHz voltage 1000mV timing 3
8.957077] [drm] nouveau 0000:03:00.0: c: memory 405MHz core 405MHz shader 810MHz voltage 900mV
8.958423] [TTM] Zone kernel: Available graphics memory: 132304724 kiB.
8.958425] [TTM] Zone dma32: Available graphics memory: 2097152 kiB.
9.658981] PM: Basic memory bitmaps created
10.444891] PM: Basic memory bitmaps freed
patti@OS121-TY3:~/ModelE/modelE_AR5_v2_branch_04-30-2013/decks>
Hi Patti,
As for the mpi issue, do the last passages of point 15 of this post seem to apply?
(Though the page if for tuning IB for MPI, perhaps this applies to your situation as well?)
From your output your ulimit is set to the default of 64. This could be causing mpi issues, and you can change it easily enough with:
ulimit -l unlimited
Particularly since you saw this with both your Fortran and GCM (C I assume) code - both with mpi - I think this may be the issue.
I believe you are running mpi on this one 48 core node? If so you should only have to set this once in your shell prior to running your job. Note that if you run this on multiple nodes, you will have to run this on each node - perhaps using your job scheduler, etc?)
Cheers,
LewsTherin
On 05/06/2013 01:16 AM, PattiMichelle wrote:
> I likely won’t get more than general advice since I
> have Kingston ECC Server memory (they “don’t support” that brand).
maybe for good reason!!
Kingston might be the best known big brand around, but afair it may
not be the most supported or trusted brand on the planet…
all RAM is not created equal.
–
dd
On 05/06/2013 01:06 AM, PattiMichelle wrote:
> Now I have added the 128GB
> back in (for 256GB total all ECC registered DDR3) and restarted
try instead to replace the original 128 with the new 128 (in the same
sockets you took the original out of–how does the new 128 run when
compared to the old 128?
if they run the same then put the old 128 into the empty socket–if
it slows down again the problem is not RAM, but probably flaky
sockets (or wrongly setup Linux)
[along the way, if you ever touched a shiny RAM contact pen, clean
them all…a soft pencil erasure will do…or clean with Blue Stuff
or similar–then don’t touch’em again with bare fingers.]
–
dd
I just completed 2 days of intense CFD calculations (with 44 CPUs) - the memory had no problems
On 05/06/2013 08:26 PM, PattiMichelle wrote:
> I just completed 2 days of intense CFD calculations (with 44 CPUs) -
> the memory had no problems
so you are now running 256 with full speed?
and, what was done to make that happen?
–
dd
No, in order to test, I accepted sloooowwww KDE. I guess now I’ll try interchanging the two batches of 128GB memory, but I don’t think that will reveal anything. It has to be some sort of setting in either the kernel or KDE that does this. Maybe a cache size or something. The speed of KDE is not affected by number of processors in use. It’s just as fast with 0, 4, or 44 maxed-out OpenMPI threads running. So it’s not an Opteron/hypertransport bottleneck problem.
Is it possible that one batch of 128GB memory is inducing wait-states? Still, that shouldn’t affect screen FPS when using a modern video card. And this may be related to that weird dbus timeout in KDE.
BTW: Thanks for the interest.
It’s a long shot, but the next time you’re powered down it would be worth
pulling one new module and one old. Compare them very carefully to see if the
chips are all identical… exactly identical. Check any other little component
parts on the module too. If you bought the two lots at different times they
may have the same part # but, having been built at different times, not have
identical components. It’s not that common with a firm like Kingston but very
common with smaller module makers.
In the unlikely event you find a difference then DenverD’s suggestion would
indicate whether or not you have an incompatibility problem with the new RAM.
Sorry for the slow reply! (it’s a server after all ;)) but I just swapped out the original batch of 128GB kingston ECC registered with the new batch (In the same slots) - and I verify that KDE is back up to speed. So it appears that it’s the size of the memory, not which batch of 128GB. So, again, the memory seems fine. I’m running my big 44-CPU openmpi CFD simulation and no memory errors are being reported.
I also looked closely at both batches - they look identical.
So it’s either a BIOS thing or a KDE/kernel thing. I’m not really skilled enough to build my own kernel. I could try upgrading to 12.3 or is there some simple way to just upgrade the kernel?
Patti