HelloI am newbie to HA and DRBD, but was confronted with a serious problem without any chance of trying to reproduce it.DRBD is used here to mirror /var/data.Looking at the logs of the primary server, it shutdown for what ever reason, and notified the secondary server to take overwhich started to correctly promote the secondary to the primary node followed by the application services.Then the resource is released. I am not sure if the error code 239 (see logs) from cryptosetup (called from Crypto script) provoked this release which must have stopped the services, but ‘CRIT: Giving up resources due to failure’ seems to be a strong beedidate.Supposedly, 239 states that the resource had been loaded.What’s confusing to me is the order of events. Can anyone shed some light on this ? What’s odd to me is that this error occured only 3 seconds after the primary server shutdown completely. Probably just a coincidence.OS 10.3 (2.6.22)Heartbeat 2.0.7DRBD 8.0.6
Jun 25 11:43:13 sms1 shutdown[6718]: shutting down for system haltJun 25 11:43:14 sms1 init: Switching to runlevel: 0....Jun 25 11:43:24 sms1 ResourceManager[6862]: debug: Starting /etc/ha.d/resource.d/Filesystem /dev/mapper/cr_drbd0 /var/data ext3 acl,user_xattr stopJun 25 11:43:24 sms1 Filesystem[7119]: INFO: Running stop for /dev/mapper/cr_drbd0 on /var/dataJun 25 11:43:24 sms1 Filesystem[7119]: INFO: Trying to unmount /var/dataJun 25 11:43:24 sms1 Filesystem[7119]: INFO: unmounted /var/data successfullyJun 25 11:43:24 sms1 Filesystem[7116]: INFO: SuccessJun 25 11:43:24 sms1 ResourceManager[6862]: debug: /etc/ha.d/resource.d/Filesystem /dev/mapper/cr_drbd0 /var/data ext3 acl,user_xattr stop done. RC=0Jun 25 11:43:24 sms1 ResourceManager[6862]: info: Running /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto stopJun 25 11:43:24 sms1 ResourceManager[6862]: debug: Starting /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto stopJun 25 11:43:24 sms1 ResourceManager[6862]: debug: /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto stop done. RC=0Jun 25 11:43:24 sms1 ResourceManager[6862]: info: Running /etc/ha.d/resource.d/drbddisk drbd0 stopJun 25 11:43:24 sms1 ResourceManager[6862]: debug: Starting /etc/ha.d/resource.d/drbddisk drbd0 stopJun 25 11:43:24 sms1 kernel: drbd0: role( Primary -> Secondary ) ....Jun 25 11:43:26 sms1 heartbeat: [3831]: info: sms1 Heartbeat shutdown complete.Jun 25 11:43:26 sms1 logd: [7262]: debug: Stopping ha_logd with pid 3772Jun 25 11:43:26 sms1 logd: [7262]: info: Waiting for pid=3772 to exitJun 25 11:43:26 sms1 logd: [3772]: debug: logd_term_action: received SIGTERMJun 25 11:43:26 sms1 logd: [3772]: debug: logd_term_action: waiting for 0 messages to be read by write processJun 25 11:43:26 sms1 logd: [3772]: debug: logd_term_action: sending SIGTERM to write processJun 25 11:43:26 sms1 logd: [3787]: info: logd_term_write_action: received SIGTERMJun 25 11:43:26 sms1 logd: [3787]: debug: Writing out 0 messages then quittingJun 25 11:43:26 sms1 logd: [3787]: info: ha_logd: Exiting write processJun 25 11:43:27 sms1 logd: [7262]: info: Pid 3772 exitedJun 25 11:43:27 sms1 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) Jun 25 11:43:27 sms1 kernel: drbd0: short read receiving data: read 1680 expected 4096Jun 25 11:43:27 sms1 kernel: drbd0: error receiving Data, l: 4120!Jun 25 11:43:27 sms1 kernel: drbd0: asender terminatedJun 25 11:43:27 sms1 kernel: drbd0: tl_clear()Jun 25 11:43:27 sms1 kernel: drbd0: Connection closedJun 25 11:43:27 sms1 kernel: drbd0: Writing meta data super block now.Jun 25 11:43:27 sms1 kernel: drbd0: conn( Disconnecting -> StandAlone ) Jun 25 11:43:27 sms1 kernel: drbd0: receiver terminatedJun 25 11:43:27 sms1 kernel: drbd0: disk( UpToDate -> Diskless ) Jun 25 11:43:27 sms1 kernel: drbd0: drbd_bm_resize called with capacity == 0Jun 25 11:43:27 sms1 kernel: drbd0: worker terminatedJun 25 11:43:27 sms1 kernel: drbd: module cleanup done.Jun 25 11:43:27 sms1 sshd[3600]: Received signal 15; terminating.....
Jun 25 11:43:24 sms2 kernel: drbd0: peer( Primary -> Secondary ) Jun 25 11:43:24 sms2 kernel: klogd 1.4.1, ---------- state change ---------- Jun 25 11:43:24 sms2 heartbeat: [3855]: info: Received shutdown notice from 'sms1'.Jun 25 11:43:24 sms2 heartbeat: [3855]: info: Resources being acquired from sms1.Jun 25 11:43:24 sms2 heartbeat: [3855]: debug: StartNextRemoteRscReq(): child count 1Jun 25 11:43:24 sms2 heartbeat: [27727]: info: acquire all HA resources (standby).Jun 25 11:43:24 sms2 heartbeat: [27728]: info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys sms2] to acquire.Jun 25 11:43:24 sms2 heartbeat: [3855]: debug: StartNextRemoteRscReq(): child count 1Jun 25 11:43:24 sms2 ResourceManager[27747]: info: Acquiring resource group: sms1 IPaddr::xx.xx.xx.xx drbddisk::drbd0 Crypto::cr_drbd0::/dev/drbd0::/etc/key.cr_drbd0::noauto Filesystem::/dev/mapper/cr_drbd0::/var/data::ext3::acl,user_xattr sms_haJun 25 11:43:24 sms2 IPaddr[27771]: INFO: Resource is stoppedJun 25 11:43:24 sms2 ResourceManager[27747]: info: Running /etc/ha.d/resource.d/IPaddr xx.xx.xx.xx startJun 25 11:43:24 sms2 ResourceManager[27747]: debug: Starting /etc/ha.d/resource.d/IPaddr xx.xx.xx.xx start...Jun 25 11:43:25 sms2 IPaddr[27814]: INFO: SuccessJun 25 11:43:25 sms2 ResourceManager[27747]: debug: /etc/ha.d/resource.d/IPaddr xx.xx.xx.xx start done. RC=0Jun 25 11:43:25 sms2 ResourceManager[27747]: info: Running /etc/ha.d/resource.d/drbddisk drbd0 startJun 25 11:43:25 sms2 ResourceManager[27747]: debug: Starting /etc/ha.d/resource.d/drbddisk drbd0 startJun 25 11:43:25 sms2 kernel: drbd0: role( Secondary -> Primary ) Jun 25 11:43:25 sms2 kernel: drbd0: Writing meta data super block now.Jun 25 11:43:25 sms2 ResourceManager[27747]: debug: /etc/ha.d/resource.d/drbddisk drbd0 start done. RC=0Jun 25 11:43:25 sms2 ResourceManager[27747]: info: Running /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto startJun 25 11:43:25 sms2 ResourceManager[27747]: debug: Starting /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto startJun 25 11:43:26 sms2 ResourceManager[27747]: debug: /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto start done. RC=0Jun 25 11:43:26 sms2 Filesystem[28000]: INFO: Resource is stoppedJun 25 11:43:26 sms2 ResourceManager[27747]: info: Running /etc/ha.d/resource.d/Filesystem /dev/mapper/cr_drbd0 /var/data ext3 acl,user_xattr startJun 25 11:43:26 sms2 ResourceManager[27747]: debug: Starting /etc/ha.d/resource.d/Filesystem /dev/mapper/cr_drbd0 /var/data ext3 acl,user_xattr startJun 25 11:43:26 sms2 Filesystem[28041]: INFO: Running start for /dev/mapper/cr_drbd0 on /var/dataJun 25 11:43:26 sms2 kernel: kjournald starting. Commit interval 5 secondsJun 25 11:43:26 sms2 kernel: EXT3-fs warning: checktime reached, running e2fsck is recommendedJun 25 11:43:26 sms2 kernel: EXT3 FS on dm-0, internal journalJun 25 11:43:26 sms2 kernel: EXT3-fs: mounted filesystem with ordered data mode.Jun 25 11:43:26 sms2 Filesystem[28038]: INFO: SuccessJun 25 11:43:26 sms2 ResourceManager[27747]: debug: /etc/ha.d/resource.d/Filesystem /dev/mapper/cr_drbd0 /var/data ext3 acl,user_xattr start done. RC=0Jun 25 11:43:26 sms2 logger: /etc/ha.d/resource.d/sms_ha status was calledJun 25 11:43:26 sms2 ResourceManager[27747]: info: Running /etc/ha.d/resource.d/sms_ha startJun 25 11:43:26 sms2 ResourceManager[27747]: debug: Starting /etc/ha.d/resource.d/sms_ha startJun 25 11:43:26 sms2 logger: /etc/ha.d/resource.d/sms_ha start was calledJun 25 11:43:26 sms2 su: (to lsvadb_3.0-2) root on noneJun 25 11:43:26 sms2 su: (to lsvadb_3.0-2) root on noneJun 25 11:43:27 sms2 kernel: drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) Jun 25 11:43:27 sms2 kernel: drbd0: Creating new current UUIDJun 25 11:43:27 sms2 kernel: drbd0: Writing meta data super block now.Jun 25 11:43:27 sms2 kernel: drbd0: meta connection shut down by peer.Jun 25 11:43:27 sms2 kernel: drbd0: asender terminatedJun 25 11:43:27 sms2 kernel: drbd0: tl_clear()Jun 25 11:43:27 sms2 kernel: drbd0: Connection closedJun 25 11:43:27 sms2 kernel: drbd0: conn( TearDown -> Unconnected ) Jun 25 11:43:27 sms2 kernel: drbd0: receiver terminatedJun 25 11:43:27 sms2 kernel: drbd0: receiver (re)startedJun 25 11:43:27 sms2 kernel: drbd0: conn( Unconnected -> WFConnection ) Jun 25 11:43:27 sms2 kernel: JBD: barrier-based sync failed on dm-0 - disabling barriersJun 25 11:43:29 sms2 su: (to lsva) root on none...Jun 25 11:43:29 sms2 etnetclient[28225]: Login successfully doneJun 25 11:43:29 sms2 etnetclient[28225]: DBConn::Connect ...Jun 25 11:43:30 sms2 ResourceManager[27747]: debug: /etc/ha.d/resource.d/sms_ha start done. RC=0Jun 25 11:43:30 sms2 heartbeat: [27727]: info: all HA resource acquisition completed (standby).Jun 25 11:43:30 sms2 heartbeat: [3855]: info: Standby resource acquisition done [all].Jun 25 11:43:30 sms2 heartbeat: [28288]: debug: notify_world: setting SIGCHLD Handler to SIG_DFLJun 25 11:43:30 sms2 harc[28288]: info: Running /etc/ha.d/rc.d/status statusJun 25 11:43:30 sms2 mach_down[28307]: info: Taking over resource group IPaddr::xx.xx.xx.xxJun 25 11:43:30 sms2 ResourceManager[28330]: info: Acquiring resource group: sms1 IPaddr::xx.xx.xx.xx drbddisk::drbd0 Crypto::cr_drbd0::/dev/drbd0::/etc/key.cr_drbd0::noauto Filesystem::/dev/mapper/cr_drbd0::/var/data::ext3::acl,user_xattr sms_haJun 25 11:43:30 sms2 IPaddr[28354]: INFO: Running OKJun 25 11:43:30 sms2 ResourceManager[28330]: info: Running /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto startJun 25 11:43:30 sms2 ResourceManager[28330]: debug: Starting /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto startJun 25 11:43:30 sms2 ResourceManager[28330]: debug: /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto start done. RC=239Jun 25 11:43:30 sms2 ResourceManager[28330]: ERROR: Return code 239 from /etc/ha.d/resource.d/CryptoJun 25 11:43:30 sms2 ResourceManager[28330]: CRIT: Giving up resources due to failure of Crypto::cr_drbd0::/dev/drbd0::/etc/key.cr_drbd0::noautoJun 25 11:43:30 sms2 ResourceManager[28330]: info: Releasing resource group: sms1 IPaddr::xx.xx.xx.xx drbddisk::drbd0 Crypto::cr_drbd0::/dev/drbd0::/etc/key.cr_drbd0::noauto Filesystem::/dev/mapper/cr_drbd0::/var/data::ext3::acl,user_xattr sms_haJun 25 11:43:30 sms2 ResourceManager[28330]: info: Running /etc/ha.d/resource.d/sms_ha stopJun 25 11:43:30 sms2 ResourceManager[28330]: debug: Starting /etc/ha.d/resource.d/sms_ha stopJun 25 11:43:31 sms2 logger: /etc/ha.d/resource.d/sms_ha stop was called....Jun 25 11:43:33 sms2 su: (to lsvadb_3.0-2) root on noneJun 25 11:43:34 sms2 etnetclient[28225]: LOG WARNING: TEXT NOT INSERTED: Jun 25 11:43:34 sms2 etnetclient[28225]: LOG TEXT : Security Module Server Shut Down.Jun 25 11:43:34 sms2 etnetclient[28225]: land.cpp 7: security::Exception: Send Notification Mail: Fehler in Datenbank : Interner Fehler in Datenbank.(SqlState=YE000)Jun 25 11:43:34 sms2 etnetclient[28225]: land.cpp 7: SM-Server ist beendet.Jun 25 11:43:39 sms2 ResourceManager[28330]: debug: /etc/ha.d/resource.d/sms_ha stop done. RC=0Jun 25 11:43:39 sms2 ResourceManager[28330]: info: Running /etc/ha.d/resource.d/Filesystem /dev/mapper/cr_drbd0 /var/data ext3 acl,user_xattr stopJun 25 11:43:39 sms2 ResourceManager[28330]: debug: Starting /etc/ha.d/resource.d/Filesystem /dev/mapper/cr_drbd0 /var/data ext3 acl,user_xattr stopJun 25 11:43:39 sms2 Filesystem[28606]: INFO: Running stop for /dev/mapper/cr_drbd0 on /var/dataJun 25 11:43:39 sms2 Filesystem[28606]: INFO: Trying to unmount /var/dataJun 25 11:43:39 sms2 Filesystem[28606]: INFO: unmounted /var/data successfullyJun 25 11:43:39 sms2 Filesystem[28603]: INFO: SuccessJun 25 11:43:39 sms2 ResourceManager[28330]: debug: /etc/ha.d/resource.d/Filesystem /dev/mapper/cr_drbd0 /var/data ext3 acl,user_xattr stop done. RC=0Jun 25 11:43:39 sms2 ResourceManager[28330]: info: Running /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto stopJun 25 11:43:39 sms2 ResourceManager[28330]: debug: Starting /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto stopJun 25 11:43:39 sms2 ResourceManager[28330]: debug: /etc/ha.d/resource.d/Crypto cr_drbd0 /dev/drbd0 /etc/key.cr_drbd0 noauto stop done. RC=0Jun 25 11:43:39 sms2 ResourceManager[28330]: info: Running /etc/ha.d/resource.d/drbddisk drbd0 stopJun 25 11:43:39 sms2 ResourceManager[28330]: debug: Starting /etc/ha.d/resource.d/drbddisk drbd0 stopJun 25 11:43:39 sms2 kernel: drbd0: role( Primary -> Secondary ) Jun 25 11:43:39 sms2 kernel: drbd0: Writing meta data super block now.Jun 25 11:43:39 sms2 ResourceManager[28330]: debug: /etc/ha.d/resource.d/drbddisk drbd0 stop done. RC=0Jun 25 11:43:39 sms2 ResourceManager[28330]: info: Running /etc/ha.d/resource.d/IPaddr xx.xx.xx.xx stopJun 25 11:43:39 sms2 ResourceManager[28330]: debug: Starting /etc/ha.d/resource.d/IPaddr xx.xx.xx.xx stopJun 25 11:43:39 sms2 avahi-daemon[3555]: Withdrawing address record for xx.xx.xx.xx on eth0.Jun 25 11:43:39 sms2 IPaddr[28720]: INFO: SuccessJun 25 11:43:39 sms2 ResourceManager[28330]: debug: /etc/ha.d/resource.d/IPaddr xx.xx.xx.xx stop done. RC=0Jun 25 11:43:39 sms2 mach_down[28307]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquiredJun 25 11:43:39 sms2 mach_down[28307]: info: mach_down takeover complete for node sms1.Jun 25 11:43:39 sms2 heartbeat: [3855]: info: mach_down takeover complete.Jun 25 11:43:56 sms2 heartbeat: [3855]: WARN: node sms1: is deadJun 25 11:43:56 sms2 heartbeat: [3855]: info: Dead node sms1 gave up resources.Jun 25 11:43:56 sms2 heartbeat: [3855]: info: Link sms1:eth0 dead.Jun 25 11:44:01 sms2 /usr/sbin/cron[28765]: (lsva) CMD (${HOME}/scripts/lsvascpcron)Jun 25 11:44:09 sms2 hb_standby[28777]: Going standby [foreign].Jun 25 11:44:09 sms2 heartbeat: [3855]: info: sms2 wants to go standby [foreign]Jun 25 11:44:19 sms2 heartbeat: [3855]: WARN: No reply to standby request. Standby request cancelled.