jCIFS deadlock issue - when can we expect a fix?

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

jCIFS deadlock issue - when can we expect a fix?

Ronny Schuetz
Hi Michael,

we're currently constantly hitting the jCIFS deadlock between
SmbSession.sessionSetup() and Transport.disconnect(). I already tried to
fix it but obviously I don't have the complete overview over the
internal locking strategy. Any chance to get this fixed soon?

Thanks & Best regards,
Ronny

Reply | Threaded
Open this post in threaded view
|

Re: jCIFS deadlock issue - when can we expect a fix?

Michael B Allen-4
On Mon, 20 Feb 2006 10:55:26 +0100
Ronny Schuetz <[hidden email]> wrote:

> Hi Michael,
>
> we're currently constantly hitting the jCIFS deadlock between
> SmbSession.sessionSetup() and Transport.disconnect(). I already tried to
> fix it but obviously I don't have the complete overview over the
> internal locking strategy. Any chance to get this fixed soon?

I don't have the time to make the change and then go through a
stabilization period. You're on your own. Sorry.

Look at the archives about this. There were two solutions discussed. One
moved a statement out of a synchronized() block. It wasn't correct
synchronization but I think the author of the patch claimed that it
worked. I believe I speculated at the time that the correct solution was
to add some kind of refcounting to certain objects. I don't remember the
details. If you locate the thread in the archives send me a pointer and
I'll see if I can narrow it down further.

Mike
Reply | Threaded
Open this post in threaded view
|

Re: jCIFS deadlock issue - when can we expect a fix?

Ronny Schuetz
Hi Mike,

> I don't have the time to make the change and then go through a
> stabilization period. You're on your own. Sorry.

Ok. No problem. It is very helpful already to have a very responsive
responsible developer available.

> Look at the archives about this. There were two solutions discussed. One
> moved a statement out of a synchronized() block. It wasn't correct
> synchronization but I think the author of the patch claimed that it
> worked. I believe I speculated at the time that the correct solution was
> to add some kind of refcounting to certain objects. I don't remember the
> details. If you locate the thread in the archives send me a pointer and
> I'll see if I can narrow it down further.

Ok, thank you very much for the pointer. I'll see what I can do and let
you know in case of questions. I don't want to change too much to keep
the solution easily portable to future version.

The current solution is basically an additional synchronization in
SmbSession#send() and Transport#disconnect() on an additional object per
Transport instance (_oSetupDisconnectMutex):

SmbSession#send():

void send(ServerMessageBlock request,
          ServerMessageBlock response)
throws SmbException
{
        if( response != null ) {
            response.received = false;
        }

        synchronized(transport._oSetupDisconnectMutex)
        {
         expiration = System.currentTimeMillis() +
                      SmbTransport.SO_TIMEOUT;
         sessionSetup( request, response );
         if( response != null && response.received ) {
             return;
         }
         request.uid = uid;
         request.auth = auth;
         transport.send( request, response );
        }
}

Transport#disconnect():
public void disconnect( boolean hard ) throws IOException {
synchronized(_oSetupDisconnectMutex)
{
 synchronized(this)
 {
        switch (state) {
            case 0: /* not connected - just return */
                return;
            case 3: /* connected - go ahead and disconnect */
                if (response_map.size() != 0 && !hard) {
                    break; /* outstanding requests */
                }
                doDisconnect( hard );
            case 4: /* in error - reset the transport */
                thread = null;
                state = 0;
                break;
            default:
                throw new TransportException( "Invalid state: " + state );
        }
  }
 }
}

First tests look good, but the tests looked good for another try
(basically the same approach; but instead of SmbSession#send(),
SmbSession#sessionSetup() was touched) last week that failed at the
weekend. I'll let you know if it works finally.

Another point that I just noticed today: During the tests I tried to
avoid the deadlock by just setting the ssnLimit to 1, as this should
create a single Transport per Session. The test is running 150
concurrent threads (to quickly reproduce the issue) accessing a single
Windows share; each thread accesses a single file and executes a loop
that constantly calls SmbFile#getFreeSpace(), deletes the file if it is
available, recreates it (32k), checks if it exists, reads it again, and
deletes it again. The operations are chosen on the operations executed
when it deadlocked before. However, the reason why I write that is that
I saw that the library opened a lot of sockets and did not even reuse
them - looked like they were all timing out. It ended up in more than
2500 open sockets. Could it be that the library does not reuse the
connections in this special case while it does when ssnLimit is for
example unset?

Best regards,
Ronny

Reply | Threaded
Open this post in threaded view
|

Re: Re: jCIFS deadlock issue - when can we expect a fix?

Michael B Allen-4
On Mon, 20 Feb 2006 18:56:47 +0100
Ronny Schuetz <[hidden email]> wrote:

> The current solution is basically an additional synchronization in
> SmbSession#send() and Transport#disconnect() on an additional object per
> Transport instance (_oSetupDisconnectMutex):

That might work. If I remember the problem correctly the problem is that
the transport thread is grabbing the transport lock and then tries to
get another lock (I think it's SmbTransport.response_map). But before
it can do so, the calling thread gets the gets the response_map lock
and then tries to get the transport lock. So now you have:

Thread-T has Lock-T trying to get Lock-M
Thread-C has Lock-M trying to get Lock-T

So what you're doing is introducing a third lock Lock-X. So Thread-T gets
Lock-X and then Lock-T. If a context switch occurs there Thread-C will try
and fail to get Lock-X allowing Thread-T to get Lock-M and finish. Then
Thread-C can get Lock-X, Lock-M, and Lock-T, do it's thing and complete.

It could kill concurrency and it's a little impure becuase you're
basically using a lock to protect locks but I think it will work. If it
doesn't, post a thread dump.

> Another point that I just noticed today: During the tests I tried to
> avoid the deadlock by just setting the ssnLimit to 1, as this should
> create a single Transport per Session. The test is running 150
> concurrent threads (to quickly reproduce the issue) accessing a single
> Windows share; each thread accesses a single file and executes a loop
> that constantly calls SmbFile#getFreeSpace(), deletes the file if it is
> available, recreates it (32k), checks if it exists, reads it again, and
> deletes it again. The operations are chosen on the operations executed
> when it deadlocked before. However, the reason why I write that is that
> I saw that the library opened a lot of sockets and did not even reuse
> them - looked like they were all timing out. It ended up in more than
> 2500 open sockets. Could it be that the library does not reuse the
> connections in this special case while it does when ssnLimit is for
> example unset?

This is a combination of two things. One, JCIFS doesn't explicitly close
sockets. They are closed after the soTimeout has expired. So if you set
ssnLimit to 1 you will create a socket for each session and you will
rapidly build up sockets. Generally you really want to avoid setting
ssnLimit to 1 as it really destroys scalability. Two, the last TCP socket
state is CLOSE_WAIT. You can see socket states using netstat -ta. If you
see CLOSE_WAIT that means the sockets were closed but the kernel keeps the
data structure around for a while for reasons I don't fully understand
(supposedly it's to wait for the final ACK but I don't understand why
the kernel would care or why it would dispose of the socket after it got
it). What you want to do is to put a long sleep at the end of the program
and run netstat -ta repeated to see if the sockets finally go away. If
they're still there after 30 minutes, then there might be a problem.

Mike
Reply | Threaded
Open this post in threaded view
|

Re: jCIFS deadlock issue - when can we expect a fix?

Ronny Schuetz
Michael B Allen wrote:

Hi Mike,

Thanks for the reply.

> That might work. If I remember the problem correctly the problem is that
> the transport thread is grabbing the transport lock and then tries to
> get another lock (I think it's SmbTransport.response_map). But before
> it can do so, the calling thread gets the gets the response_map lock
> and then tries to get the transport lock. So now you have:

Right.

> Thread-T has Lock-T trying to get Lock-M
> Thread-C has Lock-M trying to get Lock-T
>
> So what you're doing is introducing a third lock Lock-X. So Thread-T gets
> Lock-X and then Lock-T. If a context switch occurs there Thread-C will try
> and fail to get Lock-X allowing Thread-T to get Lock-M and finish. Then
> Thread-C can get Lock-X, Lock-M, and Lock-T, do it's thing and complete.
>
> It could kill concurrency and it's a little impure becuase you're
> basically using a lock to protect locks but I think it will work. If it
> doesn't, post a thread dump.

That was the idea. Thanks for your elaboration about it. Its better for
me to have a system that is a bit slower than one that deadlocks ;)
We'll see what happens tonight during the stability test.

> This is a combination of two things. One, JCIFS doesn't explicitly close
> sockets. They are closed after the soTimeout has expired. So if you set
> ssnLimit to 1 you will create a socket for each session and you will
> rapidly build up sockets. Generally you really want to avoid setting
> ssnLimit to 1 as it really destroys scalability. Two, the last TCP socket
> state is CLOSE_WAIT. You can see socket states using netstat -ta. If you
> see CLOSE_WAIT that means the sockets were closed but the kernel keeps the
> data structure around for a while for reasons I don't fully understand
> (supposedly it's to wait for the final ACK but I don't understand why
> the kernel would care or why it would dispose of the socket after it got
> it). What you want to do is to put a long sleep at the end of the program
> and run netstat -ta repeated to see if the sockets finally go away. If
> they're still there after 30 minutes, then there might be a problem.

Yes, they're going away after some seconds; they are building up that
quickly due to the nature of the test. I was just wondering, why they
are not reused, but I think I just did not completely considered how the
session concept works in jCIFS. So just for my clarification: If I'd set
lets say the ssnLimit to 100 but would run 75 threads connecting to the
same Windows share performing the operations listed above without any
pause, it could still happen that multiple sockets are opened?

Thanks & Best regards,
Ronny


Reply | Threaded
Open this post in threaded view
|

Re: Re: jCIFS deadlock issue - when can we expect a fix?

Michael B Allen-4
On Mon, 20 Feb 2006 20:44:01 +0100
Ronny Schuetz <[hidden email]> wrote:

> session concept works in jCIFS. So just for my clarification: If I'd set
> lets say the ssnLimit to 100 but would run 75 threads connecting to the
> same Windows share performing the operations listed above without any
> pause, it could still happen that multiple sockets are opened?

No. I don't think so. All 75 threads should use the same socket to
multiplex requests (unless the server periodically closes sockets). If
you set ssnLimit to 25 then you should see ~3 sockets.

Mike
Reply | Threaded
Open this post in threaded view
|

Re: jCIFS deadlock issue - when can we expect a fix?

Ronny Schuetz
Hi Mike,

> No. I don't think so. All 75 threads should use the same socket to
> multiplex requests (unless the server periodically closes sockets). If
> you set ssnLimit to 25 then you should see ~3 sockets.

Thanks for the explanation. Thats helpful to know.

After some additional stability tests with 200 threads accessing a
single windows share and another stability test of our application it
looks like the deadlock issue is fixed now. In addition to the places
already communicated, I had to synchronize SmbTree#treeConnect() in the
same way:

void treeConnect(ServerMessageBlock andx,
                 ServerMessageBlock andxResponse ) throws SmbException
{
 synchronized(session.transport()._oSetupDisconnectMutex)
 {
  // ... add original message body here ...
 }
}

Maybe you can add this change (or a better one) to the next jCIFS
version. The performance does not seem to be impacted heavily.

Best regards,
Ronny

Reply | Threaded
Open this post in threaded view
|

Re: Re: jCIFS deadlock issue - when can we expect a fix?

Michael B Allen-4
Yup. I'll add it to The List.

Thanks,
Mike

On Mon, 27 Feb 2006 12:14:25 +0100
Ronny Schuetz <[hidden email]> wrote:

> Hi Mike,
>
> > No. I don't think so. All 75 threads should use the same socket to
> > multiplex requests (unless the server periodically closes sockets). If
> > you set ssnLimit to 25 then you should see ~3 sockets.
>
> Thanks for the explanation. Thats helpful to know.
>
> After some additional stability tests with 200 threads accessing a
> single windows share and another stability test of our application it
> looks like the deadlock issue is fixed now. In addition to the places
> already communicated, I had to synchronize SmbTree#treeConnect() in the
> same way:
>
> void treeConnect(ServerMessageBlock andx,
>                  ServerMessageBlock andxResponse ) throws SmbException
> {
>  synchronized(session.transport()._oSetupDisconnectMutex)
>  {
>   // ... add original message body here ...
>  }
> }
>
> Maybe you can add this change (or a better one) to the next jCIFS
> version. The performance does not seem to be impacted heavily.
>
> Best regards,
> Ronny
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: jCIFS deadlock issue - when can we expect a fix?

Michael B Allen-4
How is this working for you Ronny?

On Mon, 27 Feb 2006 11:15:35 -0500
Michael B Allen <[hidden email]> wrote:

> Yup. I'll add it to The List.
>
> Thanks,
> Mike
>
> On Mon, 27 Feb 2006 12:14:25 +0100
> Ronny Schuetz <[hidden email]> wrote:
>
> > Hi Mike,
> >
> > > No. I don't think so. All 75 threads should use the same socket to
> > > multiplex requests (unless the server periodically closes sockets). If
> > > you set ssnLimit to 25 then you should see ~3 sockets.
> >
> > Thanks for the explanation. Thats helpful to know.
> >
> > After some additional stability tests with 200 threads accessing a
> > single windows share and another stability test of our application it
> > looks like the deadlock issue is fixed now. In addition to the places
> > already communicated, I had to synchronize SmbTree#treeConnect() in the
> > same way:
> >
> > void treeConnect(ServerMessageBlock andx,
> >                  ServerMessageBlock andxResponse ) throws SmbException
> > {
> >  synchronized(session.transport()._oSetupDisconnectMutex)
> >  {
> >   // ... add original message body here ...
> >  }
> > }
> >
> > Maybe you can add this change (or a better one) to the next jCIFS
> > version. The performance does not seem to be impacted heavily.
> >
> > Best regards,
> > Ronny
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: jCIFS deadlock issue - when can we expect a fix?

Ronny Schuetz-2
Hi Michael,

> How is this working for you Ronny?

It is in production now, but not yet used, as the system is project
based and no project is currently implementing CIFS connectivity. I
don't know when the first project is going to use it.

But as the load tests looked good, I don't really expect major issues.

Best regards,
Ronny

Reply | Threaded
Open this post in threaded view
|

Re: Re: jCIFS deadlock issue - when can we expect a fix?

Michael B Allen-4
In reply to this post by Michael B Allen-4
Applied in 1.2.8 to be released RSN.

On Mon, 27 Feb 2006 11:15:35 -0500
Michael B Allen <[hidden email]> wrote:

> Yup. I'll add it to The List.
>
> Thanks,
> Mike
>
> On Mon, 27 Feb 2006 12:14:25 +0100
> Ronny Schuetz <[hidden email]> wrote:
>
> > Hi Mike,
> >
> > > No. I don't think so. All 75 threads should use the same socket to
> > > multiplex requests (unless the server periodically closes sockets). If
> > > you set ssnLimit to 25 then you should see ~3 sockets.
> >
> > Thanks for the explanation. Thats helpful to know.
> >
> > After some additional stability tests with 200 threads accessing a
> > single windows share and another stability test of our application it
> > looks like the deadlock issue is fixed now. In addition to the places
> > already communicated, I had to synchronize SmbTree#treeConnect() in the
> > same way:
> >
> > void treeConnect(ServerMessageBlock andx,
> >                  ServerMessageBlock andxResponse ) throws SmbException
> > {
> >  synchronized(session.transport()._oSetupDisconnectMutex)
> >  {
> >   // ... add original message body here ...
> >  }
> > }
> >
> > Maybe you can add this change (or a better one) to the next jCIFS
> > version. The performance does not seem to be impacted heavily.
> >
> > Best regards,
> > Ronny
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: jCIFS deadlock issue - when can we expect a fix?

Ronny Schuetz-2
Hi,

> Applied in 1.2.8 to be released RSN.

just saw it. Thx.

Ronny