keeping sshfs alive when the system go to sleep

fperal · December 9, 2021, 8:49pm

I have a ssh mount int fstab, using rsa key

fernando@alphatauri:/home/fernando/.thunderbird  /home/fernando/.thunderbird fuse.sshfs _netdev,user,identityfile=/home/fernando/.ssh/id_rsa,allow_other,uid=fernando,gid=users  0
 0fernando@andromeda:~>

I have just configured the client to go to sleep (suspend to ram) after 10 minutes at inactivity, sometimes when resuming from inactivity the sshfs still works, other times it doesn’t. I think is relatet to the time it takes to resume from sleep, but i can’t find where is this timeout set.

In the server the sshd_config is

#       $OpenBSD: sshd_config,v 1.103 2018/04/09 20:41:22 tj Exp $ 

# This is the sshd server system-wide configuration file.  See 
# sshd_config(5) for more information. 

# This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin 

# The strategy used for options in the default sshd_config shipped with 
# OpenSSH is to specify options with their default value where 
# possible, but leave them commented.  Uncommented options override the 
# default value. 

#Port 22 
#AddressFamily any 
#ListenAddress 0.0.0.0 
#ListenAddress :: 

#HostKey /etc/ssh/ssh_host_rsa_key 
#HostKey /etc/ssh/ssh_host_ecdsa_key 
#HostKey /etc/ssh/ssh_host_ed25519_key 

# Ciphers and keying 
#RekeyLimit default none 

# Logging 
#SyslogFacility AUTH 
#LogLevel INFO 

# Authentication: 

#LoginGraceTime 2m 
#PermitRootLogin no
#StrictModes yes 
#MaxAuthTries 6 
#MaxSessions 10 

#PubkeyAuthentication yes 

# The default is to check both .ssh/authorized_keys and .ssh/authorized_keys2 
# but this is overridden so installations will only check .ssh/authorized_keys 
AuthorizedKeysFile      .ssh/authorized_keys 

#AuthorizedPrincipalsFile none 

#AuthorizedKeysCommand none 
#AuthorizedKeysCommandUser nobody 

# For this to work you will also need host keys in /etc/ssh/ssh_known_hosts 
#HostbasedAuthentication no 
# Change to yes if you don't trust ~/.ssh/known_hosts for 
# HostbasedAuthentication 
#IgnoreUserKnownHosts no 
# Don't read the user's ~/.rhosts and ~/.shosts files 
#IgnoreRhosts yes 

# To disable tunneled clear text passwords, change to no here! 
#PasswordAuthentication yes 
#PermitEmptyPasswords no 

# Change to no to disable s/key passwords 
#ChallengeResponseAuthentication yes 

# Kerberos options 
#KerberosAuthentication no 
#KerberosOrLocalPasswd yes 
#KerberosTicketCleanup yes 
#KerberosGetAFSToken no 

# GSSAPI options 
#GSSAPIAuthentication no 
#GSSAPICleanupCredentials yes 
#GSSAPIStrictAcceptorCheck yes 
#GSSAPIKeyExchange no 

# Set this to 'yes' to enable PAM authentication, account processing, 
# and session processing. If this is enabled, PAM authentication will 
# be allowed through the ChallengeResponseAuthentication and 
# PasswordAuthentication.  Depending on your PAM configuration, 
# PAM authentication via ChallengeResponseAuthentication may bypass 
# the setting of "PermitRootLogin without-password". 
# If you just want the PAM account and session checks to run without 
# PAM authentication, then enable this but set PasswordAuthentication 
# and ChallengeResponseAuthentication to 'no'. 
UsePAM yes 

#AllowAgentForwarding yes 
#AllowTcpForwarding yes 
#GatewayPorts no 
X11Forwarding yes 
#X11DisplayOffset 10 
#X11UseLocalhost yes 
#PermitTTY yes 
#PrintMotd yes 
#PrintLastLog yes 
#TCPKeepAlive yes 
#PermitUserEnvironment no 
#Compression delayed 
#ClientAliveInterval 0 
#ClientAliveCountMax 3 
#UseDNS no 
#PidFile /run/sshd.pid 
#MaxStartups 10:30:100 
#PermitTunnel no 
#ChrootDirectory none 
#VersionAddendum none 

# no default banner path 
#Banner none 

# override default of no subsystems 
Subsystem       sftp    /usr/lib/ssh/sftp-server 

# This enables accepting locale enviroment variables LC_* LANG, see sshd_config(5). 
AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES 
AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT 
AcceptEnv LC_IDENTIFICATION LC_ALL 

# Example of overriding settings on a per-user basis 
#Match User anoncvs 
#       X11Forwarding no 
#       AllowTcpForwarding no 
#       PermitTTY no 
#       ForceCommand cvs server 
**AlphaTauri:~ #**

i have read here

ServerAliveInterval: number of seconds that the client will wait before sending a null packet to the server (to keep the connection alive).
ClientAliveInterval: number of seconds that the server will wait before sending a null packet to the client (to keep the connection alive).
Setting a value of 0 (the default) will disable these features so your connection could drop if it is idle for too long.

… and it seems that is what is happening, but what is “too long”? I can’t find it.

cryptearth · December 11, 2021, 12:31pm

The issue is suspending the machine in the first place. TCP isn’t designed for that and hence many devices in between following the standards just break the connection. That’s what the timeout is for: To detect a “broken” connection and free up system resources. The moment you suspend the client it’s no longer able to respond to the keep alive requests from the server. So the server thinks the client became unresponsive (which is true as the client is in suspension/hibernation and therefor really is unresponsive) and hence resets the broken connection. When the client is then woken up again it restores it’s state to as if it was never suspended/hibernated in the first place and tries to send data to the supposed to be still open connection. As the server has already closed it the client gets a RST (reset) and either “fails” or tries to re-establish the connection.

I’m not sure if this can be done or how - but a “proper” way would be to disconnect before going to sleep and reconnect after wake up. There’re several anologies: A simple phone call for example: noone would keep the call while sleeping - but rather would hang up before going to sleep and redial after waking up again. If so the other side most likely will hang up in between - and that’s exactly what’s happen here.

TCP isn’t a magic black box that somehow only has to get connected once and can then be used until active disconnect. The connection has to be checked regular if it’s still working. If not it’s assumed broken and reset.

If you want your connection to stay alive - just set your power settings to not let the system go down into suspend or hibernate. Otherwise try to implement a proper disconnect before suspension and reconnect after waking up.

fperal · December 12, 2021, 12:30am

cryptearth:

The issue is suspending the machine in the first place. TCP isn’t designed for that and hence many devices in between following the standards just break the connection. That’s what the timeout is for: To detect a “broken” connection and free up system resources. The moment you suspend the client it’s no longer able to respond to the keep alive requests from the server. So the server thinks the client became unresponsive (which is true as the client is in suspension/hibernation and therefor really is unresponsive) and hence resets the broken connection. When the client is then woken up again it restores it’s state to as if it was never suspended/hibernated in the first place and tries to send data to the supposed to be still open connection. As the server has already closed it the client gets a RST (reset) and either “fails” or tries to re-establish the connection.

I’m not sure if this can be done or how - but a “proper” way would be to disconnect before going to sleep and reconnect after wake up. There’re several anologies: A simple phone call for example: noone would keep the call while sleeping - but rather would hang up before going to sleep and redial after waking up again. If so the other side most likely will hang up in between - and that’s exactly what’s happen here.

TCP isn’t a magic black box that somehow only has to get connected once and can then be used until active disconnect. The connection has to be checked regular if it’s still working. If not it’s assumed broken and reset.

If you want your connection to stay alive - just set your power settings to not let the system go down into suspend or hibernate. Otherwise try to implement a proper disconnect before suspension and reconnect after waking up.

I understand what you say but I was just trying it, thinking that after sleep/resume the connection will be broken, but that’s not the case, I sleep/resume and the connection is still alive and works fine. The connection die only if the client is sleeping a lot of time, not sure the amount of time, but I think at least 2 hours. So the question is where can I configure timeout length? because setting it to 24h, for instance will work for me.

deano_ferrari · December 12, 2021, 12:34am

man sshd_config

In particular…

ClientAliveCountMax
Sets the number of client alive messages which may be sent without sshd(8) receiving any messages back from
the client. If this threshold is reached while client alive messages are being sent, sshd will disconnect
the client, terminating the session. It is important to note that the use of client alive messages is very
different from TCPKeepAlive. The client alive messages are sent through the encrypted channel and therefore
will not be spoofable. The TCP keepalive option enabled by TCPKeepAlive is spoofable. The client alive
mechanism is valuable when the client or server depend on knowing when a connection has become unresponsive.

         The default value is 3.  If ClientAliveInterval is set to 15, and ClientAliveCountMax is left at the
         default, unresponsive SSH clients will be disconnected after approximately 45 seconds.  Setting a zero
         ClientAliveCountMax disables connection termination.

 ClientAliveInterval
         Sets a timeout interval in seconds after which if no data has been received from the client, sshd(8) will
         send a message through the encrypted channel to request a response from the client.  The default is 0, indi-
         cating that these messages will not be sent to the client.

arvidjaar · December 12, 2021, 7:58am

It is rather more complicated. To name a few

if client was suspended when server was transmitting data, server will close connection after TCP re-transmission attempts run out (15 by default on Linux).
there is TCP level timeout that is enabled by default (TCPKeepAlive). This is independent of application level keep alives. This means slightly more than 2 hours with default Linux settings.
if client is behind NAT, connection tracking timeout is likely much shorter than anything else and connection will be dropped.

This really must be supported in application itself. I.e. sshfs must be able to transparently establish new connection and continue.

arvidjaar · December 12, 2021, 8:31am

Or if systems have stateful firewall which is much more likely.

deano_ferrari · December 12, 2021, 9:30am

Yes, more moving parts…I forgot about TCP parameters (not something I’ve needed to frequent much)…
More info
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html

fperal · December 12, 2021, 12:40pm

That seems to be the timeout I’m experiencing.

Then Setting TCPKeepAlive to 0 in sshd_config will not work?

fperal · December 12, 2021, 5:46pm

Well, I confirm it didn’t, so I guess there are tow ways to do it

Change the behavior of the system TCPKeepAlive
Use some scripts before sleep and after resume to unmount/mount the filesystem

I’m going to do some research and I will post it here.

fperal · December 12, 2021, 5:54pm

A 3rd option will be use NFS, I have checked that nfs mounts keep working after resume without problem.

deano_ferrari · December 12, 2021, 6:47pm

No, it’s a kernel parameter.

To view the current settings use…

sudo sysctl -a | grep tcp

While they can be set on the fly with sysctl command (man sysctl for more info), for permanent configuration add the required parameter(s) to /etc/sysctl.conf or create something like /etc/sysctl.d/95-custom.conf and add there. (Take care to understand what they do first as there may be unintended impacts on the system.)

https://www.cyberciti.biz/faq/howto-set-sysctl-variables/

hcvv · December 12, 2021, 7:18pm

When NFS is a possibility I do not understand why you considered something else.

And when you are going to use NFS, I would recommend to use the sytemd automount feature in the fstab entry/ies of your client. This will mount when needed and umount when not needed after a timeout. That alone would already considerably avoid broken connections.

fperal · December 12, 2021, 7:41pm

deano_ferrari:

No, it’s a kernel parameter.

To view the current settings use…
sudo sysctl -a | grep tcp
While they can be set on the fly with sysctl command (man sysctl for more info), for permanent configuration add the required parameter(s) to /etc/sysctl.conf or create something like /etc/sysctl.d/95-custom.conf and add there. (Take care to understand what they do first as there may be unintended impacts on the system.)

System monitoring utilities | System Analysis and Tuning Guide | openSUSE Leap 15.5
https://www.cyberciti.biz/faq/howto-set-sysctl-variables/


**AlphaTauri:~ #** sysctl -a | grep tcp      
net.ipv4.**tcp**_abort_on_overflow = 0 
net.ipv4.**tcp**_adv_win_scale = 1 
net.ipv4.**tcp**_allowed_congestion_control = reno cubic 
net.ipv4.**tcp**_app_win = 31 
net.ipv4.**tcp**_autocorking = 1 
net.ipv4.**tcp**_available_congestion_control = reno cubic 
net.ipv4.**tcp**_available_ulp =  
net.ipv4.**tcp**_base_mss = 1024 
net.ipv4.**tcp**_challenge_ack_limit = 1000 
net.ipv4.**tcp**_comp_sack_delay_ns = 1000000 
net.ipv4.**tcp**_comp_sack_nr = 44 
net.ipv4.**tcp**_congestion_control = cubic 
net.ipv4.**tcp**_dsack = 1 
net.ipv4.**tcp**_early_demux = 1 
net.ipv4.**tcp**_early_retrans = 3 
net.ipv4.**tcp**_ecn = 2 
net.ipv4.**tcp**_ecn_fallback = 1 
net.ipv4.**tcp**_fack = 0 
net.ipv4.**tcp**_fastopen = 1 
net.ipv4.**tcp**_fastopen_blackhole_timeout_sec = 3600 
net.ipv4.**tcp**_fastopen_key = 00000000-00000000-00000000-00000000 
net.ipv4.**tcp**_fin_timeout = 60 
net.ipv4.**tcp**_frto = 2 
net.ipv4.**tcp**_fwmark_accept = 0 
net.ipv4.**tcp**_invalid_ratelimit = 500 
net.ipv4.**tcp**_keepalive_intvl = 75 
net.ipv4.**tcp**_keepalive_probes = 9 
net.ipv4.**tcp**_keepalive_time = 7200 
net.ipv4.**tcp**_l3mdev_accept = 0 
net.ipv4.**tcp**_limit_output_bytes = 1048576 
net.ipv4.**tcp**_low_latency = 0 
net.ipv4.**tcp**_max_orphans = 16384 
net.ipv4.**tcp**_max_reordering = 300 
net.ipv4.**tcp**_max_syn_backlog = 256 
net.ipv4.**tcp**_max_tw_buckets = 16384 
net.ipv4.**tcp**_mem = 45285        60383   90570 
net.ipv4.**tcp**_min_rtt_wlen = 300 
net.ipv4.**tcp**_min_snd_mss = 48 
net.ipv4.**tcp**_min_tso_segs = 2 
net.ipv4.**tcp**_moderate_rcvbuf = 1 
net.ipv4.**tcp**_mtu_probing = 0 
net.ipv4.**tcp**_no_metrics_save = 0 
net.ipv4.**tcp**_notsent_lowat = 4294967295 
net.ipv4.**tcp**_orphan_retries = 0 
net.ipv4.**tcp**_pacing_ca_ratio = 120 
net.ipv4.**tcp**_pacing_ss_ratio = 200 
net.ipv4.**tcp**_probe_interval = 600 
net.ipv4.**tcp**_probe_threshold = 8 
net.ipv4.**tcp**_recovery = 1 
net.ipv4.**tcp**_reordering = 3 
net.ipv4.**tcp**_retrans_collapse = 1 
net.ipv4.**tcp**_retries1 = 3 
net.ipv4.**tcp**_retries2 = 15 
net.ipv4.**tcp**_rfc1337 = 0 
net.ipv4.**tcp**_rmem = 4096        131072  6291456 
net.ipv4.**tcp**_rx_skb_cache = 0 
net.ipv4.**tcp**_sack = 1 
net.ipv4.**tcp**_slow_start_after_idle = 1 
net.ipv4.**tcp**_stdurg = 0 
net.ipv4.**tcp**_syn_retries = 6 
net.ipv4.**tcp**_synack_retries = 5 
net.ipv4.**tcp**_syncookies = 1 
net.ipv4.**tcp**_thin_linear_timeouts = 0 
net.ipv4.**tcp**_timestamps = 1 
net.ipv4.**tcp**_tso_win_divisor = 3 
net.ipv4.**tcp**_tw_reuse = 2 
net.ipv4.**tcp**_tx_skb_cache = 0 
net.ipv4.**tcp**_window_scaling = 1 
net.ipv4.**tcp**_wmem = 4096        16384   4194304 
net.ipv4.**tcp**_workaround_signed_windows = 0 
net.netfilter.nf_conntrack_**tcp**_be_liberal = 0 
net.netfilter.nf_conntrack_**tcp**_ignore_invalid_rst = 0 
net.netfilter.nf_conntrack_**tcp**_loose = 1 
net.netfilter.nf_conntrack_**tcp**_max_retrans = 3 
net.netfilter.nf_conntrack_**tcp**_timeout_close = 10 
net.netfilter.nf_conntrack_**tcp**_timeout_close_wait = 60 
net.netfilter.nf_conntrack_**tcp**_timeout_established = 432000 
net.netfilter.nf_conntrack_**tcp**_timeout_fin_wait = 120 
net.netfilter.nf_conntrack_**tcp**_timeout_last_ack = 30 
net.netfilter.nf_conntrack_**tcp**_timeout_max_retrans = 300 
net.netfilter.nf_conntrack_**tcp**_timeout_syn_recv = 60 
net.netfilter.nf_conntrack_**tcp**_timeout_syn_sent = 120 
net.netfilter.nf_conntrack_**tcp**_timeout_time_wait = 120 
net.netfilter.nf_conntrack_**tcp**_timeout_unacknowledged = 300 
**AlphaTauri:~ #**

or more specific

**AlphaTauri:~ #** sysctl net.ipv4.tcp_keepalive_time 
net.ipv4.tcp_keepalive_time = 7200 
**AlphaTauri:~ #**

which is exactly 2 hours

I will try 24h

**AlphaTauri:~ #** sysctl net.ipv4.tcp_keepalive_time=86400 
net.ipv4.tcp_keepalive_time = 86400 
**AlphaTauri:~ #**

May be there any reason why 2h is okay and 24h isn’t?

fperal · December 12, 2021, 7:44pm

Well I configured it in order to be able of use it from outside home and from inside and I use the same method, but in fact when I use it from outside I just use it some minutes while when I use it at home I have it mounted all day, so I think you’re right and It will be better to use different methods.

arvidjaar · December 13, 2021, 6:36am

No, TCPKeepAlive is ssh(d) option.

deano_ferrari · December 13, 2021, 6:47am

My bad. I thought you were referring to the kernel TCP parameters as well.

arvidjaar · December 13, 2021, 7:02am

Kernel defines when TCP keep alive probes start and how many times they are retried. Whether it is used at all is per-socket option SO_KEEPALIVE which is set by TCPKeepAlive.

To avoid any timeout one at least needs to disable TCPKeepAlive (on both sides), ClientAliveInterval, ServerAliveInterval and firewall on both sides. If there is any NAT in between, it will likely timeout anyway.

deano_ferrari · December 13, 2021, 7:26am

Ok, thanks for clarifying.

fperal · December 13, 2021, 12:03pm

Well, in fact changing TCPKeepAlive in sshd_config did not work, but changing net.ipv4.tcp_keepalive_time from sysconfig did.

cryptearth · December 14, 2021, 1:43am

It’s not me to judge - but tinker with timeouts meant to detect a fault just cause you seem to prefer suspension/hibernation rather than clean shutdown sounds quite wrong to me.