Timeout waaay too long

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Timeout waaay too long

Daniel K-2
Hi,

ever since I began sharing files between Linux and Windows systems I've
been suffering from one major problem that affects smb, nfs as well as
cifs: Long lock ups on server down times.

If the serving computer shuts down / crashes, every process hangs when
trying to access the still mounted share. Just try df, mount or ls. You
can't even umount the share in this condition! When using smb this was
very severe as you couldn't mount a share with a "soft" option. So the
process in fact just hung until I pressed reset. Now, with cifs it is
much better. But still the timeout  is very long. And you still cannot
umount an offline share to prevent further lock ups.

Now my suggestion would be to make the timeout adjustable (now it is
something like 30 seconds on every access). And maybe automatic umounts
if a share isn't available anymore. At least one should be able to
umount such a share.

Best regards,
Daniel
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: Timeout waaay too long

studdugie
I wholeheartedly agree! CIFS inability to gracefully handle network
failure is rediculous! As Daniel suggests, at a minium, one should be
able to unmount a failed share. I'm forced to periodically restart, an
otherwise stable Linux box, just because of this issue.

I've complained about it before but I think it has fallen on deaf
ears. If I had the resources to fix it myself I would but I don't so I
can't.

Dane

On 2/26/06, Daniel K <[hidden email]> wrote:

> Hi,
>
> ever since I began sharing files between Linux and Windows systems I've
> been suffering from one major problem that affects smb, nfs as well as
> cifs: Long lock ups on server down times.
>
> If the serving computer shuts down / crashes, every process hangs when
> trying to access the still mounted share. Just try df, mount or ls. You
> can't even umount the share in this condition! When using smb this was
> very severe as you couldn't mount a share with a "soft" option. So the
> process in fact just hung until I pressed reset. Now, with cifs it is
> much better. But still the timeout  is very long. And you still cannot
> umount an offline share to prevent further lock ups.
>
> Now my suggestion would be to make the timeout adjustable (now it is
> something like 30 seconds on every access). And maybe automatic umounts
> if a share isn't available anymore. At least one should be able to
> umount such a share.
>
> Best regards,
> Daniel
> _______________________________________________
> linux-cifs-client mailing list
> [hidden email]
> https://lists.samba.org/mailman/listinfo/linux-cifs-client
>
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: Timeout waaay too long

Daniel K-2
Hi,

I once tried to reduce the hard coded time out values via a patch:

***begin***
--- a/fs/cifs/connect.c    2006-02-15 18:31:49.000000000 +0100
+++ b/fs/cifs/connect.c    2006-02-15 18:29:09.000000000 +0100
@@ -1433,7 +1433,7 @@
         user space buffer */
      cFYI(1,("sndbuf %d rcvbuf %d rcvtimeo
0x%lx",(*csocket)->sk->sk_sndbuf,
          (*csocket)->sk->sk_rcvbuf, (*csocket)->sk->sk_rcvtimeo));
-    (*csocket)->sk->sk_rcvtimeo = 7 * HZ;
+    (*csocket)->sk->sk_rcvtimeo = 2 * HZ;
     /* make the bufsizes depend on wsize/rsize and max requests */
     if((*csocket)->sk->sk_sndbuf < (200 * 1024))
         (*csocket)->sk->sk_sndbuf = 200 * 1024;
@@ -1552,7 +1552,7 @@
     /* Eventually check for other socket options to change from
         the default. sock_setsockopt not used because it expects
         user space buffer */
-    (*csocket)->sk->sk_rcvtimeo = 7 * HZ;
+    (*csocket)->sk->sk_rcvtimeo = 2 * HZ;
       
     return rc;
 }
--- a/fs/cifs/transport.c    2006-02-15 18:31:49.000000000 +0100
+++ b/fs/cifs/transport.c    2006-02-15 18:29:09.000000000 +0100
@@ -426,7 +426,7 @@
     else if (long_op > 2) {
         timeout = MAX_SCHEDULE_TIMEOUT;
     } else
-        timeout = 15 * HZ;
+        timeout = 3 * HZ;
     /* wait for 15 seconds or until woken up due to response arriving or
        due to last connection to this server being unmounted */
     if (signal_pending(current)) {
@@ -437,12 +437,10 @@
 
     /* No user interrupts in wait - wreaks havoc with performance */
     if(timeout != MAX_SCHEDULE_TIMEOUT) {
-        timeout += jiffies;
-        wait_event(ses->server->response_q,
-            (!(midQ->midState & MID_REQUEST_SUBMITTED)) ||
-            time_after(jiffies, timeout) ||
+    wait_event_interruptible_timeout(ses->server->response_q,
+          (!(midQ->midState & MID_REQUEST_SUBMITTED)) ||
             ((ses->server->tcpStatus != CifsGood) &&
-             (ses->server->tcpStatus != CifsNew)));
+             (ses->server->tcpStatus != CifsNew)), timeout);
     } else {
         wait_event(ses->server->response_q,
             (!(midQ->midState & MID_REQUEST_SUBMITTED)) ||
@@ -693,7 +691,7 @@
     else if (long_op > 2) {
         timeout = MAX_SCHEDULE_TIMEOUT;
     } else
-        timeout = 15 * HZ;
+        timeout = 3 * HZ;
     /* wait for 15 seconds or until woken up due to response arriving or
        due to last connection to this server being unmounted */
     if (signal_pending(current)) {
@@ -704,12 +702,10 @@
 
     /* No user interrupts in wait - wreaks havoc with performance */
     if(timeout != MAX_SCHEDULE_TIMEOUT) {
-        timeout += jiffies;
-        wait_event(ses->server->response_q,
-            (!(midQ->midState & MID_REQUEST_SUBMITTED)) ||
-            time_after(jiffies, timeout) ||
+    wait_event_interruptible_timeout(ses->server->response_q,
+          (!(midQ->midState & MID_REQUEST_SUBMITTED)) ||
             ((ses->server->tcpStatus != CifsGood) &&
-             (ses->server->tcpStatus != CifsNew)));
+             (ses->server->tcpStatus != CifsNew)), timeout);
     } else {
         wait_event(ses->server->response_q,
             (!(midQ->midState & MID_REQUEST_SUBMITTED)) ||
--- a/fs/cifs/cifssmb.c    2006-02-15 21:44:14.000000000 +0100
+++ b/fs/cifs/cifssmb.c    2006-02-15 18:29:09.000000000 +0100
@@ -111,7 +111,7 @@
                    timeout which is 7 seconds */
             while(tcon->ses->server->tcpStatus == CifsNeedReconnect) {
                 
wait_event_interruptible_timeout(tcon->ses->server->response_q,
-                    (tcon->ses->server->tcpStatus == CifsGood), 10 * HZ);
+                    (tcon->ses->server->tcpStatus == CifsGood), 3 * HZ);
                 if(tcon->ses->server->tcpStatus == CifsNeedReconnect) {
                     /* on "soft" mounts we wait once */
                     if((tcon->retry == FALSE) ||
@@ -221,7 +221,7 @@
                    timeout which is 7 seconds */
             while(tcon->ses->server->tcpStatus == CifsNeedReconnect) {
                 
wait_event_interruptible_timeout(tcon->ses->server->response_q,
-                    (tcon->ses->server->tcpStatus == CifsGood), 10 * HZ);
+                    (tcon->ses->server->tcpStatus == CifsGood), 3 * HZ);
                 if(tcon->ses->server->tcpStatus ==
                         CifsNeedReconnect) {
                     /* on "soft" mounts we wait once */
***end***

But I'm not an expert either, so anyone who has a better understanding
of what he's doing maybe could post some ideas. Another approach:
Perhaps it is possible to check whether the server is online by simply
pinging it prior to any access. I mean, in a LAN servers normally
respond within just a few milliseconds, so even an extremely short time
out would do the trick and save a lot of trouble.
I am sure, with the feature added to an upcoming version of CIFS, it
would get a far superior advantage over NFS and SMB.

Best regards,
Daniel
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: Timeout waaay too long

Daniel-72
Hello!?

Is this list / CIFS development dead? Would be unfortunate...
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: Timeout waaay too long

Steven French

No - I have been at connectathon testing the code against a wide variety of the major server vendors. Found and fixed one important client bug (and half a dozen smaller/minor fixes and workarounds) and found various problems in most of the servers.


Steve French
Senior Software Engineer
Linux Technology Center - IBM Austin
phone: 512-838-2294
email: sfrench at-sign us dot ibm dot com
Inactive hide details for Daniel <dk80@gmx.net>Daniel <[hidden email]>



To

[hidden email]

cc


Subject

Re: [linux-cifs-client] Timeout waaay too long

Hello!?

Is this list / CIFS development dead? Would be unfortunate...
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client


_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client

pic27201.gif (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Timeout waaay too long

Daniel K-2
Hi,

I see. Sorry for my impatience.
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: Timeout waaay too long

Steve French (smfltc)
In reply to this post by Daniel K-2
> been suffering from one major problem that affects smb, nfs as well as
>cifs: Long lock ups on server down times.
>
>If the serving computer shuts down / crashes, every process hangs when
>trying to access the still mounted share. Just try df, mount or ls. You
>can't even umount the share in this condition! When using smb this was
>very severe as you couldn't mount a share with a "soft" option. So the
>process in fact just hung until I pressed reset. Now, with cifs it is
>much better. But still the timeout  is very long. And you still cannot
>umount an offline share to prevent further lock ups.
To bring other users up to speed I should recap the current implementation/
behavior of the cifs code here (if this does not match users experiences
with current code let me know).

SEND: We will attempt to send CIFS requests (on stuck or full tcp sockets)
for approximately seven seconds in smb_send or 15 seconds in smb_send2
(since this handles large writes commonly 52K which presumably could
take longer), but we do not alter th sk_sndtimeo from its default
(infinite).  Linux NFS server (but interestingly not the NFS client)
alters the socket sk_sndtimeo but it is not whether the sk_sndtimeo
does need to be set to a lower value (perhaps 10-30 seconds) since
the kernel socket API seems to report back EAGAIN and ENOSPC as needed
already.

RECEIVE: If the request was put on the wire successfully we wait different
amounts of time depending on the type of request.
1) blocking requests (blocking byte range lock requests and ChangeNotify
(dnotify) requests) wait forever (unless we umount or kill the request
thread)

2) "long" requests such as single page or smaller writes past end
of file block 180 seconds.  This probably should be changed to
only happen when the write is far past end of file not when
the file length is only being incremented by a page or so.
Some servers (such as Win9x IIRC) which don't make files
sparse can take a long time when a write is made far beyond
end of file.  We should also be setting this longer timeout
on "offline" files but currently I don't have an easy way
to test this and the cifs client ignores the dos
attribute offline flag.  Suggestions on how to get Windows
to set/return the offline flag would be appreciated

3) "medium" requests block 45 seconds (to allow for oplock breaks
that the server may have to send to hung clients to timeout -
which can take from 20-40 seconds depending on the server). This
includes NTCreateX (and legacy OpenX) but perhaps should be
expanded to include other path based calls (SetPathInfo and Delete).
This also includes calls to cifs_writepages using iovecs and
the new SMBWrite2/smb_send2 interface.  It also includes
writes which are not past the end of file.

4) "normal" requests timeout in 15 seconds

5) nonblocking requests (client responses to oplock break requests,
rfc1001 sessioninits) do not timeout - we return immediately.


TIMEOUTS: Stuck requests are noticed on certain errors coming
back from the socket and also in the cifs_dnotify_thread which
wakes up the request and response queues every 15 seconds (to
allow them to check their timeout flags).   Note that this change
to cifs_dnotify_thread was added not that long ago. If any SMBs
time out we kill the socket (or if the socket goes dead for other
reasons) and we try to reconnect and in some cases retry the request.
If the "hard" mount option is set ("soft" is the default) then cifs
will try to reconnect until umount (see smb_init in fs/cifs/cifssmb.c).
On path based calls (setattr, readdir, open, delete, mkdir, rmdir
etc.), we attempt to retry once even if "soft" mount option specified
but we only wait 10 seconds in smb_init (or small_smb_init) for
cifsd to reconnect the dead socket before giving up and returning to
the caller.  On handle based calls (e.g. read and write) we can not
retry within cifssmb.c since the handles on a dead session are
no longer valid - but retries do occur in the calling function but
with the similar restriction that we only wait 10 seconds before
giving up.  It is somewhat dangerous to timeout writes/reads
back to the user since that causes a pagefault (unlike path
based calls such as open which are expected to be able to fail
and thus most applications handle these errors more gracefully)

UMOUNT: The cifs umount code was fixed in the last six months to
handle stuck requests and responses better but basically umount
tries to mark the mount as closing (see cifs_umount_begin
if fs/cifs/cifsfs.c) wake up all requests that are stuck waiting
on tcp sends, wake up all requests that are stuck waiting
for SMB responses (from presumably hung servers) then retries
waking up stuck requests (to catch requests blocked on the
max request count of 50 active on the wire per session that could have
snuck in when the max request count went under 50 and thus blocked
again).  I have done various tests last week both killing servers
("killall smbd") and also simulating different types of hangs by pulling
the network cable and also by going into smbd in gdb to simulate
server hangs - and umount always worked (even without requiring
the force flag ie "umount /cifsmount --force).  The only "slow
umount" (which takes about 15 seconds) is the case to NetApp servers
where some of their servers can return a malformed ulogoffX SMB
response and thus cause mount to block waiting for a good response
before giving up on the server and killing the tcp session explicitly).
So I would like to know if you have an umount scenario on reasonably
current code in which cifs won't umount (within 15 seconds) and should.
If such a scenario exists we will need to look at the blockids
("echo t > /proc/sysrq-trigger" and then dmesg) to find out whether
cifs could reasonably wake up requests queued on that particular
block id (presumably blocked outside of cifs) - but I am aware
of no such problem at this time.

>Perhaps it is possible to check whether the server is online by simply
> pinging it prior to any access. I mean, in a LAN servers normally
> respond within just a few milliseconds, so even an extremely short time
> out would do the trick and save a lot of trouble.
SMB echo could be used for this purpose (and can also alter payload
sizes to help estimate delay).  Anyone care to suggest a design?
A possible approach is to launch a new long running cifs kernel thread
or use the existing cifs_dnotify_thread (which wakes up every 15
seconds to check for stuck requests)

> I am sure, with the feature added to an upcoming version of CIFS, it
> would get a far superior advantage over NFS and SMB.
There are a few cases where cifs is faster Linux to Linux/Samba
than nfs would be (certain write cases, and also cases in which the
oplock caching advantages of cifs outweigh nfs's advantages in
dispatching more read requests at one time and responding to
read faster) and a few cases where cifs has functional
advantages (although nfs has a big advantage in one key functional
area ie in handling advisory locking well - which we are close to
implementing due to the recent work on the server side from jra
 of the samba team).  When mounting to Windows and similar server systems
cifs has more substantial advantages, but for the Linux to Linux/Samba
case (vs. NFS) it is hard to generalize since both implementations
are moving targets and the Linux clients for each are among the fastest
moving (ie most updated) components of the Linux kernel at least
when measured by number of changesets per month.  In addition
NFS version 4 adds some "cifs-like" features and offers a third
interesting alternative.

I don't mind pursuing three types of changes here:
1) "poll" the server via periodic SMB echo and reduce
the timeouts sharply (or even kill the tcp session to the
server) if the server stops responding to SMB echo.  This
will be an even more powerful approach in conjunction with
failover when DFS replicas are available for that share
(which cifs code can currently recognize but not connect to).

2) Fix the "hard" vs. "soft" mount option for cifs to be recognized
in more places in the code.

3) allow the request timeouts to be configurable (via new mount option "timeo")
as we see with nfs version 4. See below:

       hard   The  program  accessing  a  file  on  a NFS mounted file system will hang when the server
              crashes. The process cannot be interrupted or killed unless you also specify intr.   When
              the  NFS  server  is back online the program will continue undisturbed from where it was.
              This is probably what you want.

       soft   This option allows the kernel to time out if the NFS server is not  responding  for  some
              time.  The  time  can  be  specified with timeo=time.  This timeout value is expressed in
              tenths of a second.  The soft option might be useful if your NFS server sometimes doesn’t
              respond  or  will  be  rebooted  while  some process tries to get a file from the server.
              Avoid using this option with proto=udp or with a short timeout.

I would like to hear experimental feedback and suggestions from users on this topic as it is
hard to predict the types of failure scenarios that todays complex networks (with routers that
lose packets, firewalls that silently "swallow" connection requests on certain ports, and
servers/OS with various bugs that can cause different types of requests to hang)

_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: Re: Timeout waaay too long

Daniel K-2
Hi,
> So I would like to know if you have an umount scenario on reasonably
> current code in which cifs won't umount (within 15 seconds) and should.
I just did some testing. I have a system running Kanotix with kernel
2.6.14. It obviously is too old and shows even more problems. The other
system, on which I first came across these problems is a linux test
system which works as a digital video recorder. Now, with recordings
distributed across the network is can get very annoying if e.g. the
on-screen display freezes if another system is down.
Anyway, this system is running 2.6.16-rc5 and uses an older version of
busybox. That might be the reason because umount didn't work. I compiled
your umount.cifs.c and it works just fine! Then I tried a newer busybox
version and it too seemed to work.
So, don't worry. There are no umount related problems. Only the long
timeouts are a bit annoying.

Bye and thanks for looking into this subject,
Daniel
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client