rsync freezes whole computer without error messages

Dear opensuse community.

I am currently facing a problem with rsync (or maybe ssh, I don’t know yet). I have daily backups that are done via cron using rsync from my computer (client) to a NAS. And systematically during these backups my computer completely freezes (no remote login possible, nothing), so I have to hard reboot. As you might imagine, I cannot do this every day.

In /var/log/messages there is just a line like that :
Feb 26 17:23:53 eggplant rsyslogd: – MARK –

right before the hang. The rsync worked well before and I don’t really know what changed in my system config.

In /var/log/rsyncd.log I sometimes (but not every time) I get :

2011/02/26 16:39:19 [13229] rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(543) [Receiver=3.0.7]

I aldready update the default rsync from opensuse 11.3 to 3.0.7.7, but it does not help.

How can I at least know what is going wrong? I am trying to track down the problem but don’t know how to start.

Last also a log of the rsync command (client) running ssh in verbose using the command “rsync -av --delete --timeout=180 -e “ssh -v -p xxxxxx -i /root/.ssh/rsync-key_eggplant” /mds root@xxxxxxxx:/volume1/autoBackup/eggplant”:

OpenSSH_5.4p1, OpenSSL 1.0.0 29 Mar 2010
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to xxxxxxxxxxxxx.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file /root/.ssh/rsync-key_eggplant type 2
debug1: identity file /root/.ssh/rsync-key_eggplant-cert type -1
debug1: Remote protocol version 2.0, remote software version OpenSSH_4.2
debug1: match: OpenSSH_4.2 pat OpenSSH_4*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.4
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr hmac-md5 none
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Host ‘xxxxxxxxx’ is known and matches the RSA host key.
debug1: Found key in /root/.ssh/known_hosts:6
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: Roaming not allowed by server
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password,keyboard-interactive
debug1: Next authentication method: publickey
debug1: Offering public key: /root/.ssh/rsync-key_eggplant
debug1: Server accepts key: pkalg ssh-dss blen 434
debug1: read PEM private key done: type DSA
debug1: Authentication succeeded (publickey).
debug1: channel 0: new [client-session]
debug1: Entering interactive session.
debug1: Sending environment.
debug1: Sending env LANG = en_US.UTF-8
debug1: Sending command: rsync --server -vlogDtpre.iLsf --timeout=180 --delete . /volume1/autoBackup/eggplant/
sending incremental file list

I saw a few other people had similar problems when there were long file lists or huge files to sync (which is my case), but I couldn’t find a solution. Another weird thing is, that I also rsync from my computer to another hard drive on my computer and then there is no problem.

Does anyone have an idea how to tackle this problem?

Thanks in advance.

Peter

On 2011-02-26 18:06, pschmidtke wrote:
>
> Dear opensuse community.
>
> I am currently facing a problem with rsync (or maybe ssh, I don’t know
> yet). I have daily backups that are done via cron using rsync from my
> computer (client) to a NAS. And systematically during these backups my
> computer completely freezes (no remote login possible, nothing), so I
> have to hard reboot. As you might imagine, I cannot do this every day.

Suspect disk hardware error. First run the smart long test on your disks,
then do a dd if=entire_disk of=/dev/null

Also, suspect memory, run the test on the dvd.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

I am currently facing a problem with rsync (or maybe ssh, I don’t know yet). I have daily backups that are done via cron using rsync from my computer (client) to a NAS. And systematically during these backups my computer completely freezes (no remote login possible, nothing), so I have to hard reboot.
I had this happen to me a couple of days ago. The output to the log file stopped in mid-word, nothing in the sys log.

The disk drive is new, the memory is new and is ECC. The backup is local disk to local disk (using rsync, v3.0.7 protocol version 30, to backup the home directory). All drives are formatted ext4.

Really, the only recent change to the system is an upgrade to linux kernel v2.6.34.7-0.7-desktop x86_64. I’ve had other problems with the system becoming unstable after 2 - 5 days of uptime, all since the upgrade; this was the first time it totally froze, though.

jimoe666 wrote:

> The disk drive is new, the memory is new and is ECC. The backup is
> local disk to local disk (using rsync, v3.0.7 protocol version 30, to
> backup the home directory). All drives are formatted ext4.

Did you run memtest on that new memory? I find a signifcant number of “new”
memory sticks won’t survive a complete test - infant mortality is a real
factor.


Will Honea

On 2011-02-27 07:36, jimoe666 wrote:

> The disk drive is new, the memory is new and is ECC.

All the more reason to test it.

> The backup is
> local disk to local disk (using rsync, v3.0.7 protocol version 30, to
> backup the home directory). All drives are formatted ext4.
>
> Really, the only recent change to the system is an upgrade to linux
> kernel v2.6.34.7-0.7-desktop x86_64. I’ve had other problems with the
> system becoming unstable after 2 - 5 days of uptime, all since the
> upgrade; this was the first time it totally froze, though.

Test disks and memory, please. Thoroughly.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

Thanks Carlos. I appreciate the help. The extended smartctl test just finished without any error. I am running dd now, although I don’t know what that does (I suppose it’s streaming all content from my disk to /dev/null).

I’ll check memory tomorrow but last I checked with memtest (1 year ago) it was ok. Start already thinking about other possible reasons :wink:

Cheers.

Peter

Ok the dd run gives me this :

eggplant:/home/peter # dd if=/dev/sdb of=/dev/null
1953525168+0 records in
1953525168+0 records out
1000204886016 bytes (1.0 TB) copied, 13283.6 s, 75.3 MB/s

I suppose that this means that the HDD is OK. Now RAM.

Ok, I also checked the memory. Passed without any errors…
now I’m open to non hardware error suggestions. Any ideas? Why does it happen when I do it towards a NAS and not when I do it locally?

Thanks in advance.

Peter

On 2011-03-01 12:36, pschmidtke wrote:
>
> Ok, I also checked the memory. Passed without any errors…
> now I’m open to non hardware error suggestions. Any ideas? Why does it
> happen when I do it towards a NAS and not when I do it locally?

I don’t know…

Do you have another computer? So that you can leave an ssh session open,
see if that one locks or not. Ping, etc.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

I can access to the faulty computer via ssh without any problem. The freezing just happens with rsync and also once when I was copying (via scp) manually a huge directory.

On 2011-03-01 15:06, pschmidtke wrote:
>
> I can access to the faulty computer via ssh without any problem. The
> freezing just happens with rsync and also once when I was copying (via
> scp) manually a huge directory.

I mean to check if you can still ssh-in while (or after) the lock occurs.

Another thing would be having the rsync session running verbose and see on
which file it freezes, if it is the same each time.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

Ah ok, I didn’t get that one. When the computer freezes, everything freezes. No SSH login possible, unfortunately. I am also logging rsync and often it hangs on the same files but not always. Sometimes it’s even on a completely different partition.

On 2011-03-01 19:06, pschmidtke wrote:
>
> Ah ok, I didn’t get that one. When the computer freezes, everything
> freezes. No SSH login possible, unfortunately.

I meant ssh in before it crashes, and leave it on. When rsync crashes, see
if you can still type in the ssh session. Sometimes it works.

> I am also logging rsync
> and often it hangs on the same files but not always. Sometimes it’s even
> on a completely different partition.

I can’t guess what is happening.

You say that a local rsync session, of the same lot of files, works fine.
Network card, a problem with it?

Leave the computer showing VT10, the one with error messages. Perhaps there
is something there when it crashes and you can read it.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

I ran through all of the hardware tests as well. All 3 drives tested okay; there were no errors recorded either then or in the past. The same for memory; it went through 3 passes without error, and with ECC turned off.

The system froze again two days ago. Same time, shortly (20 seconds) after rsync started its operation. It is a complete freeze; it was nothing more than a space heater.

jimoe666 wrote:
> I ran through all of the hardware tests as well. All 3 drives tested
> okay; there were no errors recorded either then or in the past. The same
> for memory; it went through 3 passes without error, and with ECC turned
> off.
>
> The system froze again two days ago. Same time, shortly (20 seconds)
> after rsync started its operation. It is a complete freeze; it was
> nothing more than a space heater.

Since the problem also occurred with scp as well as rsync, it seems that
the problem is not with rsync. It could be a problem with ssh, or
something else in the underlying systems.

Have you checked the versions of ssh and kernel on both your machine and
NAS? It may be worth trying more recent versions if they are older.

It might be a strange hardware problem (network, disks, mobo). Have you
googled for any possibilities?

You may get more information about the freeze if you export the kernel
log. It sometimes produces information that doesn’t make it onto the
screen or into the log file but there are ways to access such messages.
The traditional, and most reliable, way is to direct the kernel log
additionally to a serial port, and then connect another screen or system
to that port. It’s also possible to direct the kernel to send its log
entries over the net to another machine, where they can be logged to a
file. Sometimes these messages also get through.

Cheers, Dave

On 2011-03-02 13:26, Dave Howorth wrote:

> You may get more information about the freeze if you export the kernel
> log. It sometimes produces information that doesn’t make it onto the
> screen or into the log file but there are ways to access such messages.
> The traditional, and most reliable, way is to direct the kernel log
> additionally to a serial port, and then connect another screen or system
> to that port.

It is the only reliable method. And it must be a real serial port on the
sending side, not a usb to serial converter. The serial port is very basic
in the kernel, it keeps working while the rest is almost crashed precisely
because they want this debugging feature.

> It’s also possible to direct the kernel to send its log
> entries over the net to another machine, where they can be logged to a
> file. Sometimes these messages also get through.

Sometimes. If the network card crashes, you get nothing.


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

My situation is different from pschmidtke’s. I am copying from local disk to local disk. And the freeze does not happen every time. Maybe 1 out of 10 times?

ok, after all hardware checks I activated my daily backups again (local and remote), but I saw that they happened to be done at exactly the same time in cron. Now I put them at very different times and for now I don’t see any crashes anymore (already 2 successful syncs without crash, before it crashed 100% of the time). So just some details on what I was syncing.

The local sync was for the system partition (so some of the / like etc, usr etc…) that went to another local hard drive (/backup).
Then the other sync that happened to be simultaneously is done from various partitions (but not / and not /backup) to a remote server. Can this interfere somehow?! Weird in my opinion.

How can I activate kernel logging?

On 2011-03-03 11:36, pschmidtke wrote:
>
> ok, after all hardware checks I activated my daily backups again (local
> and remote), but I saw that they happened to be done at exactly the same
> time in cron. Now I put them at very different times and for now I don’t
> see any crashes anymore (already 2 successful syncs without crash,
> before it crashed 100% of the time). So just some details on what I was
> syncing.

Curious…

> The local sync was for the system partition (so some of the / like etc,
> usr etc…) that went to another local hard drive (/backup).
> Then the other sync that happened to be simultaneously is done from
> various partitions (but not / and not /backup) to a remote server. Can
> this interfere somehow?! Weird in my opinion.

Not that I know, but… :-?

> How can I activate kernel logging?

It is active by default, but you may increase verbosity.

/etc/sysconfig/syslog

KERNEL_LOGLEVEL=7


Cheers / Saludos,

Carlos E. R.
(from 11.2 x86_64 “Emerald” at Telcontar)

jimoe666 wrote:
> My situation is different from pschmidtke’s.

Sorry to everybody for my noise. I was skimming the thread quickly and
didn’t realize that jimoe666 was trying to hijack somebody else’s thread.

jimoe666 please start your own thread.