Dlm service can not start (without Pacemaker)

Hello there,

I try to start dlm service (storage cluster with corosync and gfs2) , but it stuck in “activating” state and not will be “active”. The service default configured that Type=Notify, so it wait to READY=1 message. But it not will come… I i re-set the type to “simple” all of the nodes and services seems work well…

So i dont know, why not start normally, on default settings…

Here the output of the service (when it timeout)

Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 node_config 3
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 found /dev/misc/dlm-control minor 122
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 found /dev/misc/dlm-monitor minor 121
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 found /dev/misc/dlm_plock minor 120
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set log_debug 1
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set mark 0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set protocol 1
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set /proc/sys/net/core/rmem_default 4194304
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set /proc/sys/net/core/rmem_max 4194304
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set recover_callbacks 1
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 cmap totem.cluster_name = 'gitlab_storage'
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set cluster_name gitlab_storage
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 /dev/misc/dlm-monitor fd 13
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 cluster quorum 1 seq 508 nodes 3
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 cluster node 1 added seq 508
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set_configfs_node 1 10.51.38.66 local 0 mark 0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 cluster node 2 added seq 508
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set_configfs_node 2 10.51.38.69 local 1 mark 0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 cluster node 3 added seq 508
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 set_configfs_node 3 10.51.38.92 local 0 mark 0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 cpg_join dlm:controld ...
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 setup_cpg_daemon 15
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 dlm:controld conf 3 1 0 memb 1 2 3 join 2 left 0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon joined 1
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon joined 2
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon joined 3
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 dlm:controld ring 1:508 3 memb 1 2 3
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 receive_protocol 1 max 3.1.1.0 run 3.1.1.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon node 1 prot max 0.0.0.0 run 0.0.0.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon node 1 save max 3.1.1.0 run 3.1.1.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 run protocol from nodeid 1
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon run 3.1.1 max 3.1.1 kernel run 1.1.1 max 1.1.1
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 plocks 16
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 receive_fence_clear from 1 for 2 result 0 flags 6
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 clear_startup_nodes 3
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 fence_in_progress_unknown 0 recv
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 receive_protocol 3 max 3.1.1.0 run 3.1.1.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon node 3 prot max 0.0.0.0 run 0.0.0.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon node 3 save max 3.1.1.0 run 3.1.1.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 receive_protocol 2 max 3.1.1.0 run 0.0.0.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon node 2 prot max 0.0.0.0 run 0.0.0.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon node 2 save max 3.1.1.0 run 0.0.0.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 receive_protocol 2 max 3.1.1.0 run 3.1.1.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon node 2 prot max 3.1.1.0 run 0.0.0.0
Feb 12 14:29:39 hun25-10v dlm_controld[20194]: 2284 daemon node 2 save max 3.1.1.0 run 3.1.1.0
Feb 12 14:31:09 hun25-10v systemd[1]: dlm.service: start operation timed out. Terminating.
Feb 12 14:31:09 hun25-10v dlm_controld[20194]: 2374 helper pid 20195 term signal 15
Feb 12 14:31:09 hun25-10v dlm_controld[20194]: 2374 helper pid 20195 term signal 15
Feb 12 14:31:09 hun25-10v dlm_controld[20194]: 2374 shutdown
Feb 12 14:31:09 hun25-10v dlm_controld[20194]: 2374 cpg_leave dlm:controld ...
Feb 12 14:31:09 hun25-10v dlm_controld[20194]: 2374 clear_configfs_nodes rmdir "/sys/kernel/config/dlm/cluster/comms/3"
Feb 12 14:31:09 hun25-10v dlm_controld[20194]: 2374 clear_configfs_nodes rmdir "/sys/kernel/config/dlm/cluster/comms/2"
Feb 12 14:31:09 hun25-10v dlm_controld[20194]: 2374 clear_configfs_nodes rmdir "/sys/kernel/config/dlm/cluster/comms/1"
Feb 12 14:31:09 hun25-10v systemd[1]: dlm.service: Failed with result 'timeout'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit dlm.service has entered the 'failed' state with result 'timeout'.

configure corosync


totem {

        version: 2

        # Set name of the cluster
        cluster_name: gitlab_storage
        # crypto_cipher and crypto_hash: Used for mutual node authentication.
        # If you choose to enable this, then do remember to create a shared
        # secret with "corosync-keygen".
        # enabling crypto_cipher, requires also enabling of crypto_hash.
        # crypto works only with knet transport
        crypto_cipher: none
        crypto_hash: none
}



logging {

        # Log the source file and line where messages are being
        # generated. When in doubt, leave off. Potentially useful for
        # debugging.
        fileline: off
        # Log to standard error. When in doubt, set to yes. Useful when
        # running in the foreground (when invoking "corosync -f")
        to_stderr: yes
        # Log to a log file. When set to "no", the "logfile" option
        # must not be set.
        to_logfile: yes
        logfile: /var/log/cluster/corosync.log
        # Log to the system log daemon. When in doubt, set to yes.
        to_syslog: yes
        # Log debug messages (very verbose). When in doubt, leave off.
        debug: off
        # Log messages with time stamps. When in doubt, set to hires (or on)
        #timestamp: hires
        logger_subsys {

                subsys: QUORUM
                debug: off
        }
}

quorum {

        # Enable and configure quorum subsystem (default: off)
        # see also corosync.conf.5 and votequorum.5
        provider: corosync_votequorum

}

nodelist {

        # Change/uncomment/add node sections to match cluster configuration
        node {

                # Hostname of the node
                name: node1
                # Cluster membership node identifier
                nodeid: 1
                # Address of first link
                ring0_addr: 10.51.38.66
                # When knet transport is used it's possible to define up to 8 links
                #ring1_addr: 192.168.1.1
        }

        node {
#               # Hostname of the node
                name: node2
#               # Cluster membership node identifier
                nodeid: 2
#               # Address of first link
                ring0_addr: 10.51.38.69
#               # When knet transport is used it's possible to define up to 8 links
#               #ring1_addr: 192.168.1.2
        }
 
       node {

#               # Hostname of the node
                name: node3
#               # Cluster membership node identifier
                nodeid: 3
#               # Address of first link
                ring0_addr: 10.51.38.92
#               # When knet transport is used it's possible to define up to 8 links
#               #ring1_addr: 192.168.1.2
        }
        # ...
}

dlm.conf

[CODE]log_debug=1
daemon_debug=1
protocol=tcp
# Disable fencing (for now)
enable_fencing=0

dlm_controld sends READY=1 when it completes cluster initialization. So it indicates that initialization never completes.

@arvidjaar

So it is problem with Corosync config?

here my corosync log:

[1297] hun25-04v corosyncnotice  [MAIN  ] Corosync Cluster Engine ('2.4.6'): started and ready to provide service.
[1297] hun25-04v corosyncinfo    [MAIN  ] Corosync built-in features: testagents systemd qdevices qnetd pie relro bindnow
[1297] hun25-04v corosyncnotice  [TOTEM ] Initializing transport (UDP/IP Multicast).
[1297] hun25-04v corosyncnotice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
[1297] hun25-04v corosyncnotice  [TOTEM ] The network interface [10.51.38.66] is now up.
[1297] hun25-04v corosyncnotice  [SERV  ] Service engine loaded: corosync configuration map access [0]
[1297] hun25-04v corosyncinfo    [QB    ] server name: cmap
[1297] hun25-04v corosyncnotice  [SERV  ] Service engine loaded: corosync configuration service [1]
[1297] hun25-04v corosyncinfo    [QB    ] server name: cfg
[1297] hun25-04v corosyncnotice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
[1297] hun25-04v corosyncinfo    [QB    ] server name: cpg
[1297] hun25-04v corosyncnotice  [SERV  ] Service engine loaded: corosync profile loading service [4]
[1297] hun25-04v corosyncnotice  [QUORUM] Using quorum provider corosync_votequorum
[1297] hun25-04v corosyncnotice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
[1297] hun25-04v corosyncinfo    [QB    ] server name: votequorum
[1297] hun25-04v corosyncnotice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
[1297] hun25-04v corosyncinfo    [QB    ] server name: quorum
[1297] hun25-04v corosyncnotice  [TOTEM ] A new membership (10.51.38.66:526) was formed. Members joined: 1
[1297] hun25-04v corosyncwarning [TOTEM ] Discarding JOIN message during flush, nodeid=3
[1297] hun25-04v corosyncnotice  [QUORUM] Members[1]: 1
[1297] hun25-04v corosyncnotice  [MAIN  ] Completed service synchronization, ready to provide service.
[1297] hun25-04v corosyncnotice  [TOTEM ] A new membership (10.51.38.66:530) was formed. Members joined: 3
[1297] hun25-04v corosyncnotice  [QUORUM] This node is within the primary component and will provide service.
[1297] hun25-04v corosyncnotice  [QUORUM] Members[2]: 1 3
[1297] hun25-04v corosyncnotice  [MAIN  ] Completed service synchronization, ready to provide service.
[1297] hun25-04v corosyncnotice  [TOTEM ] A new membership (10.51.38.66:534) was formed. Members joined: 2
[1297] hun25-04v corosyncnotice  [QUORUM] Members[3]: 1 2 3
[1297] hun25-04v corosyncnotice  [MAIN  ] Completed service synchronization, ready to provide service.

@arvidjaar

What do you think, safe to use it with simple mode, if i not install pacemaker?