Hi.

Env is a two node OpenSUSE 11.3 plus the cluster bits running ocfs2 1.6.3-0.13.9 and dlm 3.00.01-7.34 with cfs enabled.

Cluster is running fine, but did have an issue on a customer environment running SUSE 11 SP1 and clustering where in a locking issue between cluster members were keeping access to a directory locked out.

In the course of that analysis I noted a process holding a PR lock on the same lock name but the pid it was attached to was non-existant. That same non existent pid was found to be associated with over 70k locks after doing a dlm_tool dump over every lock structure on the system.

I decided to check this much newer reved box to see if was the same and it was.

This cluster has had it's members removed/added for patching, reboots and stuff over time, but the cluster has been up for a very a long time.

So what do you think, is this intended behavior for a non existant pid to pick up so many locks or is something else brewing?

Thanks for looking!




falcon:~ # ls -la /sys/kernel/debug/dlm/ | egrep -v '_|clv' | awk '{print $9}' | grep -v '\.' | grep -v '^$' | while read x; do dlm_tool lockdump $x >> /var/tmp/dump_locks ; done

falcon:~ # wc -l /var/tmp/dump_locks
151287 /var/tmp/dump_locks

falcon:~ # awk '{print $8}' /var/tmp/dump_locks | sort | uniq -c | sort -g | tail -10
4 7551
4 7553
4 7554
4 7586
4 7587
7 7548
26 4244
28 7544
2567 5536
148595 6083 <<< ???

falcon:~ # ps -ef | grep 6083
root 31223 3735 0 14:23 pts/0 00:00:00 grep 6083

Node 2;

harrier:~ # ls -la /sys/kernel/debug/dlm/ | egrep -v '_|clv' | awk '{print $9}' | grep -v '\.' | grep -v '^$' | while read x; do dlm_tool lockdump $ump_locks ; done

harrier:~ # wc -l /var/tmp/dump_locks
1268 /var/tmp/dump_locks

harrier:~ # awk '{print $8}' /var/tmp/dump_locks | sort | uniq -c | sort -g | tail -10
45 4847
48 8782
51 4792
54 10612
66 19437
76 4846
93 10617
123 10619
156 4817
265 10616

harrier:~ # ps -ef | grep 10616
daemon 10616 4832 0 06:29 ? 00:00:00 /usr/local/apache/bin/httpd -DSTATUS -f /usr/local/apache/conf/httpd.conf
root 13778 3580 0 14:27 pts/0 00:00:00 grep 10616


Shutting down the cluster and to see how it looks after the cluster is recreated.

harrier:~ # rcopenais stop
Stopping OpenAIS/Corosync daemon (corosync): ...........Stopping SBD - OK

falcon:~ # rcopenais stop
Stopping OpenAIS/Corosync daemon (corosync): ............Stopping SBD - OK

harrier:~ # rcopenais start
Starting OpenAIS/Corosync daemon (corosync): Starting SBD - starting... OK

falcon:~ # rcopenais start
Starting OpenAIS/Corosync daemon (corosync): Starting SBD - starting... OK


After shutting down the cluster and removing the previous lock dumps;


falcon:~ # ls -la /sys/kernel/debug/dlm/ | egrep -v '_|clv' | awk '{print $9}' | grep -v '\.' | grep -v '^$' | while read x; do dlm_tool lockdump $x >> /var/tmp/dump_locks ; done

falcon:~ # wc -l /var/tmp/dump_locks
31 /var/tmp/dump_locks

falcon:~ # awk '{print $8}' /var/tmp/dump_locks | sort | uniq -c | sort -g | tail -10
2 1860
2 1862
2 1878
4 2174
21 1605

harrier:~ # ls -la /sys/kernel/debug/dlm/ | egrep -v '_|clv' | awk '{print $9}' | grep -v '\.' | grep -v '^$' | while read x; do dlm_tool lockdump $x >> /var/tmp/dump_locks ; done

harrier:~ # wc -l /var/tmp/dump_locks
311 /var/tmp/dump_locks

harrier:~ # awk '{print $8}' /var/tmp/dump_locks | sort | uniq -c | sort -g | tail -10
3 15899
4 15959
5 15949
6 15890
6 16244
13 16043
15 15847
22 15387
62 15917
160 16237