Winbind using 100% CPU

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Winbind using 100% CPU

Dylan Klomparens
(Re-posting on this email list per Jeremy Allison's request.)

I am trying to figure out why winbind is using 100% CPU on my file server.
I am using Samba version 4.0.4. Everything is fine for a few minutes when I
start winbind, however after a while it begins using 100% CPU. I haven't
been able to narrow down what triggers this CPU usage spike, but I did
attach the GNU debugger to find out what's going on in the process. The
backtrace revealed this information:

#0  0x000000000041cf30 in _talloc_free@plt ()
#1  0x0000000000452320 in winbindd_reinit_after_fork ()
#2  0x00000000004524e6 in fork_domain_child ()
#3  0x0000000000453585 in wb_child_request_trigger ()
#4  0x000000381d2048e2 in tevent_common_loop_immediate () from
/lib64/libtevent.so.0
#5  0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
#6  0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
#7  0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
#8  0x000000000042049a in main ()

Apparently it's stuck in the winbindd_reinit_after_fork (and more
specifically the _talloc_free function). This code resides in
$SOURCE_HOME\source3\winbindd\winbindd_dual.c.

Perhaps I have configured Samba incorrectly? Here are the parameters I am
using that have to do with winbind:
idmap config * : backend = nss
idmap config * : range = 1000 - 300000

What are some reasons that winbind is using 100% CPU and how can I resolve
this?
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Andreas Schneider-15
On Thursday 11 April 2013 09:58:33 Dylan Klomparens wrote:

> (Re-posting on this email list per Jeremy Allison's request.)
>
> I am trying to figure out why winbind is using 100% CPU on my file server.
> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
> start winbind, however after a while it begins using 100% CPU. I haven't
> been able to narrow down what triggers this CPU usage spike, but I did
> attach the GNU debugger to find out what's going on in the process. The
> backtrace revealed this information:
>
> #0  0x000000000041cf30 in _talloc_free@plt ()
> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
> #2  0x00000000004524e6 in fork_domain_child ()
> #3  0x0000000000453585 in wb_child_request_trigger ()
> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
> /lib64/libtevent.so.0
> #5  0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
> #6  0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
> #7  0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
> #8  0x000000000042049a in main ()

If you get to this backtrace again, please call

bt full

and also do:

p talloc_report_full(NULL, fopen("/tmp/talloc_report.log","w"))

attach the log.


Jim, does this look familiar? :)


        -- andreas


--
Andreas Schneider                   GPG-ID: F33E3FC6
Samba Team                             [hidden email]
www.samba.org

Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Andreas Schneider-15
On Thursday 11 April 2013 16:17:51 Andreas Schneider wrote:

> On Thursday 11 April 2013 09:58:33 Dylan Klomparens wrote:
> > (Re-posting on this email list per Jeremy Allison's request.)
> >
> > I am trying to figure out why winbind is using 100% CPU on my file server.
> > I am using Samba version 4.0.4. Everything is fine for a few minutes when
> > I
> > start winbind, however after a while it begins using 100% CPU. I haven't
> > been able to narrow down what triggers this CPU usage spike, but I did
> > attach the GNU debugger to find out what's going on in the process. The
> > backtrace revealed this information:
> >
> > #0  0x000000000041cf30 in _talloc_free@plt ()
> > #1  0x0000000000452320 in winbindd_reinit_after_fork ()
> > #2  0x00000000004524e6 in fork_domain_child ()
> > #3  0x0000000000453585 in wb_child_request_trigger ()
> > #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
> > /lib64/libtevent.so.0
> > #5  0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
> > #6  0x00007fbed6b9922e in s3_event_loop_once () from
> > /lib64/libsmbconf.so.0
> > #7  0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
> > #8  0x000000000042049a in main ()
>
> If you get to this backtrace again, please call
>
> bt full
>
> and also do:
>
> p talloc_report_full(NULL, fopen("/tmp/talloc_report.log","w"))
>
> attach the log.

Oh, are you able to run it with valgrind?

valgrind --tool=memcheck -v --num-callers=20 --trace-children=yes
/path/to/winbindd


--
Andreas Schneider                   GPG-ID: F33E3FC6
Samba Team                             [hidden email]
www.samba.org

Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Dylan Klomparens
Here are two valgrind reports, one with leak-check=summary (the default),
and the other with leak-check=full (very verbose). The exact command line
is at the top of each file. The Samba executable I am running is from a
Fedora package so debug symbols are not included in the executable. Thus, I
couldn't run the desired gdb commands. Is this the first time this bug has
been encountered? I can try and download/compile the source and attempt to
re-capture the problem with more information if that is necessary.

-- Dylan


On Thu, Apr 11, 2013 at 10:31 AM, Andreas Schneider <[hidden email]> wrote:

> On Thursday 11 April 2013 16:17:51 Andreas Schneider wrote:
> > On Thursday 11 April 2013 09:58:33 Dylan Klomparens wrote:
> > > (Re-posting on this email list per Jeremy Allison's request.)
> > >
> > > I am trying to figure out why winbind is using 100% CPU on my file
> server.
> > > I am using Samba version 4.0.4. Everything is fine for a few minutes
> when
> > > I
> > > start winbind, however after a while it begins using 100% CPU. I
> haven't
> > > been able to narrow down what triggers this CPU usage spike, but I did
> > > attach the GNU debugger to find out what's going on in the process. The
> > > backtrace revealed this information:
> > >
> > > #0  0x000000000041cf30 in _talloc_free@plt ()
> > > #1  0x0000000000452320 in winbindd_reinit_after_fork ()
> > > #2  0x00000000004524e6 in fork_domain_child ()
> > > #3  0x0000000000453585 in wb_child_request_trigger ()
> > > #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
> > > /lib64/libtevent.so.0
> > > #5  0x00007fbed6b98e17 in run_events_poll () from
> /lib64/libsmbconf.so.0
> > > #6  0x00007fbed6b9922e in s3_event_loop_once () from
> > > /lib64/libsmbconf.so.0
> > > #7  0x000000381d204060 in _tevent_loop_once () from
> /lib64/libtevent.so.0
> > > #8  0x000000000042049a in main ()
> >
> > If you get to this backtrace again, please call
> >
> > bt full
> >
> > and also do:
> >
> > p talloc_report_full(NULL, fopen("/tmp/talloc_report.log","w"))
> >
> > attach the log.
>
> Oh, are you able to run it with valgrind?
>
> valgrind --tool=memcheck -v --num-callers=20 --trace-children=yes
> /path/to/winbindd
>
>
> --
> Andreas Schneider                   GPG-ID: F33E3FC6
> Samba Team                             [hidden email]
> www.samba.org
>
>

Valgrind.zip (519K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Jeremy Allison
In reply to this post by Dylan Klomparens
On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:

> (Re-posting on this email list per Jeremy Allison's request.)
>
> I am trying to figure out why winbind is using 100% CPU on my file server.
> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
> start winbind, however after a while it begins using 100% CPU. I haven't
> been able to narrow down what triggers this CPU usage spike, but I did
> attach the GNU debugger to find out what's going on in the process. The
> backtrace revealed this information:
>
> #0  0x000000000041cf30 in _talloc_free@plt ()
> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
> #2  0x00000000004524e6 in fork_domain_child ()
> #3  0x0000000000453585 in wb_child_request_trigger ()
> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
> /lib64/libtevent.so.0
> #5  0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
> #6  0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
> #7  0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
> #8  0x000000000042049a in main ()
>
> Apparently it's stuck in the winbindd_reinit_after_fork (and more
> specifically the _talloc_free function). This code resides in
> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.

That looks like corrupted memory - probably a loop
in the talloc tree.

Can you run under valgrind and try and reproduce ?

Jeremy.
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Andreas Schneider-15
In reply to this post by Dylan Klomparens
On Thursday 11 April 2013 15:06:45 Dylan Klomparens wrote:
> Here are two valgrind reports, one with leak-check=summary (the default),
> and the other with leak-check=full (very verbose).

Hi Dylan,

Thanks for the logs, I don't see anything really problematic just

 ==1646== Syscall param writev(vector[...]) points to uninitialised byte(s)

which is probably from dcerpc_add_auth_footer() but this is not the codepath
from the backtrace. Did you run into the 100% problem with running under
valgrind?

> The exact command line
> is at the top of each file. The Samba executable I am running is from a
> Fedora package so debug symbols are not included in the executable. Thus, I
> couldn't run the desired gdb commands.

Please install the debuginfo package so we get the symbols. You can do this
with:

debuginfo-install samba

> Is this the first time this bug has been encountered?

More or less. Did winbind hit the 100% problem while running with valgrind?

> I can try and download/compile the source and attempt to
> re-capture the problem with more information if that is necessary.

debuginfo-install samba

Then please get us a full backtrace and the talloc report. After that please
run again with valgrind.


Thanks,


        -- andreas



> -- Dylan
>
> On Thu, Apr 11, 2013 at 10:31 AM, Andreas Schneider <[hidden email]> wrote:
> > On Thursday 11 April 2013 16:17:51 Andreas Schneider wrote:
> > > On Thursday 11 April 2013 09:58:33 Dylan Klomparens wrote:
> > > > (Re-posting on this email list per Jeremy Allison's request.)
> > > >
> > > > I am trying to figure out why winbind is using 100% CPU on my file
> >
> > server.
> >
> > > > I am using Samba version 4.0.4. Everything is fine for a few minutes
> >
> > when
> >
> > > > I
> > > > start winbind, however after a while it begins using 100% CPU. I
> >
> > haven't
> >
> > > > been able to narrow down what triggers this CPU usage spike, but I did
> > > > attach the GNU debugger to find out what's going on in the process.
> > > > The
> > > > backtrace revealed this information:
> > > >
> > > > #0  0x000000000041cf30 in _talloc_free@plt ()
> > > > #1  0x0000000000452320 in winbindd_reinit_after_fork ()
> > > > #2  0x00000000004524e6 in fork_domain_child ()
> > > > #3  0x0000000000453585 in wb_child_request_trigger ()
> > > > #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
> > > > /lib64/libtevent.so.0
> > > > #5  0x00007fbed6b98e17 in run_events_poll () from
> >
> > /lib64/libsmbconf.so.0
> >
> > > > #6  0x00007fbed6b9922e in s3_event_loop_once () from
> > > > /lib64/libsmbconf.so.0
> > > > #7  0x000000381d204060 in _tevent_loop_once () from
> >
> > /lib64/libtevent.so.0
> >
> > > > #8  0x000000000042049a in main ()
> > >
> > > If you get to this backtrace again, please call
> > >
> > > bt full
> > >
> > > and also do:
> > >
> > > p talloc_report_full(NULL, fopen("/tmp/talloc_report.log","w"))
> > >
> > > attach the log.
> >
> > Oh, are you able to run it with valgrind?
> >
> > valgrind --tool=memcheck -v --num-callers=20 --trace-children=yes
> > /path/to/winbindd
> >
> >
> > --
> > Andreas Schneider                   GPG-ID: F33E3FC6
> > Samba Team                             [hidden email]
> > www.samba.org
--
Andreas Schneider                   GPG-ID: F33E3FC6
Samba Team                             [hidden email]
www.samba.org

Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Dylan Klomparens
I've installing the debug symbols and attached the requested information.


On Fri, Apr 12, 2013 at 4:52 AM, Andreas Schneider <[hidden email]> wrote:

> On Thursday 11 April 2013 15:06:45 Dylan Klomparens wrote:
> > Here are two valgrind reports, one with leak-check=summary (the default),
> > and the other with leak-check=full (very verbose).
>
> Hi Dylan,
>
> Thanks for the logs, I don't see anything really problematic just
>
>  ==1646== Syscall param writev(vector[...]) points to uninitialised byte(s)
>
> which is probably from dcerpc_add_auth_footer() but this is not the
> codepath
> from the backtrace. Did you run into the 100% problem with running under
> valgrind?
>
> > The exact command line
> > is at the top of each file. The Samba executable I am running is from a
> > Fedora package so debug symbols are not included in the executable.
> Thus, I
> > couldn't run the desired gdb commands.
>
> Please install the debuginfo package so we get the symbols. You can do this
> with:
>
> debuginfo-install samba
>
> > Is this the first time this bug has been encountered?
>
> More or less. Did winbind hit the 100% problem while running with valgrind?
>
> > I can try and download/compile the source and attempt to
> > re-capture the problem with more information if that is necessary.
>
> debuginfo-install samba
>
> Then please get us a full backtrace and the talloc report. After that
> please
> run again with valgrind.
>
>
> Thanks,
>
>
>         -- andreas
>
>
>
> > -- Dylan
> >
> > On Thu, Apr 11, 2013 at 10:31 AM, Andreas Schneider <[hidden email]>
> wrote:
> > > On Thursday 11 April 2013 16:17:51 Andreas Schneider wrote:
> > > > On Thursday 11 April 2013 09:58:33 Dylan Klomparens wrote:
> > > > > (Re-posting on this email list per Jeremy Allison's request.)
> > > > >
> > > > > I am trying to figure out why winbind is using 100% CPU on my file
> > >
> > > server.
> > >
> > > > > I am using Samba version 4.0.4. Everything is fine for a few
> minutes
> > >
> > > when
> > >
> > > > > I
> > > > > start winbind, however after a while it begins using 100% CPU. I
> > >
> > > haven't
> > >
> > > > > been able to narrow down what triggers this CPU usage spike, but I
> did
> > > > > attach the GNU debugger to find out what's going on in the process.
> > > > > The
> > > > > backtrace revealed this information:
> > > > >
> > > > > #0  0x000000000041cf30 in _talloc_free@plt ()
> > > > > #1  0x0000000000452320 in winbindd_reinit_after_fork ()
> > > > > #2  0x00000000004524e6 in fork_domain_child ()
> > > > > #3  0x0000000000453585 in wb_child_request_trigger ()
> > > > > #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
> > > > > /lib64/libtevent.so.0
> > > > > #5  0x00007fbed6b98e17 in run_events_poll () from
> > >
> > > /lib64/libsmbconf.so.0
> > >
> > > > > #6  0x00007fbed6b9922e in s3_event_loop_once () from
> > > > > /lib64/libsmbconf.so.0
> > > > > #7  0x000000381d204060 in _tevent_loop_once () from
> > >
> > > /lib64/libtevent.so.0
> > >
> > > > > #8  0x000000000042049a in main ()
> > > >
> > > > If you get to this backtrace again, please call
> > > >
> > > > bt full
> > > >
> > > > and also do:
> > > >
> > > > p talloc_report_full(NULL, fopen("/tmp/talloc_report.log","w"))
> > > >
> > > > attach the log.
> > >
> > > Oh, are you able to run it with valgrind?
> > >
> > > valgrind --tool=memcheck -v --num-callers=20 --trace-children=yes
> > > /path/to/winbindd
> > >
> > >
> > > --
> > > Andreas Schneider                   GPG-ID: F33E3FC6
> > > Samba Team                             [hidden email]
> > > www.samba.org
> --
> Andreas Schneider                   GPG-ID: F33E3FC6
> Samba Team                             [hidden email]
> www.samba.org
>
>

debug.zip (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Jim McDonough-2
In reply to this post by Jeremy Allison
On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <[hidden email]> wrote:

> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
>> (Re-posting on this email list per Jeremy Allison's request.)
>>
>> I am trying to figure out why winbind is using 100% CPU on my file server.
>> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
>> start winbind, however after a while it begins using 100% CPU. I haven't
>> been able to narrow down what triggers this CPU usage spike, but I did
>> attach the GNU debugger to find out what's going on in the process. The
>> backtrace revealed this information:
>>
>> #0  0x000000000041cf30 in _talloc_free@plt ()
>> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
>> #2  0x00000000004524e6 in fork_domain_child ()
>> #3  0x0000000000453585 in wb_child_request_trigger ()
>> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
>> /lib64/libtevent.so.0
>> #5  0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
>> #6  0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
>> #7  0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
>> #8  0x000000000042049a in main ()
>>
>> Apparently it's stuck in the winbindd_reinit_after_fork (and more
>> specifically the _talloc_free function). This code resides in
>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
>
> That looks like corrupted memory - probably a loop
> in the talloc tree.
I've got a user who sees this and we're adding the same dlinklist
element twice, creating a loop in the winbind child list.

I've got a broken wrist so responses take a while, but that's my
current hint.  On 3.6.3 and 3.6.13.

>
> Can you run under valgrind and try and reproduce ?
>
> Jeremy.



--
Jim McDonough
Samba Team
SUSE labs
jmcd at samba dot org
jmcd at themcdonoughs dot org
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Jim McDonough-2
On Tue, Apr 16, 2013 at 10:00 AM, Jim McDonough <[hidden email]> wrote:

> On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <[hidden email]> wrote:
>> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
>>> (Re-posting on this email list per Jeremy Allison's request.)
>>>
>>> I am trying to figure out why winbind is using 100% CPU on my file server.
>>> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
>>> start winbind, however after a while it begins using 100% CPU. I haven't
>>> been able to narrow down what triggers this CPU usage spike, but I did
>>> attach the GNU debugger to find out what's going on in the process. The
>>> backtrace revealed this information:
>>>
>>> #0  0x000000000041cf30 in _talloc_free@plt ()
>>> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
>>> #2  0x00000000004524e6 in fork_domain_child ()
>>> #3  0x0000000000453585 in wb_child_request_trigger ()
>>> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
>>> /lib64/libtevent.so.0
>>> #5  0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
>>> #6  0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
>>> #7  0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
>>> #8  0x000000000042049a in main ()
>>>
>>> Apparently it's stuck in the winbindd_reinit_after_fork (and more
>>> specifically the _talloc_free function). This code resides in
>>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
>>
>> That looks like corrupted memory - probably a loop
>> in the talloc tree.
> I've got a user who sees this and we're adding the same dlinklist
> element twice, creating a loop in the winbind child list.
>
> I've got a broken wrist so responses take a while, but that's my
> current hint.  On 3.6.3 and 3.6.13.
>

I see this in the parent winbind log.  The last 3 entries are changes
I made to the dlist macros within winbind only.  you can see a 2
second delay and then a second add of the same item to the child
winbind list.  No entries in between (and a production system so there
is reluctance to increase the debug level).

[2013/03/14 09:08:57.795585,  3]
winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
  getpwnam xx+xxxxxxxxx-112$
[2013/03/14 09:08:57.796185,  3]
winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
  getpwnam xx+yyyyyyy
[2013/03/14 09:09:00.077660,  0]
winbindd/winbindd_dual.c:1399(fork_domain_child)
  adding 0x7ff91105a000 to list at 0x7ff911053510
[2013/03/14 09:09:02.435367,  0]
winbindd/winbindd_dual.c:1399(fork_domain_child)
  adding 0x7ff91105a000 to list at 0x7ff91105a000
[2013/03/14 09:09:02.435510,  0] lib/util.c:1117(smb_panic)
  PANIC (pid 28628): duplicate
--
Jim McDonough
Samba Team
SUSE labs
jmcd at samba dot org
jmcd at themcdonoughs dot org
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Jeremy Allison
On Tue, Apr 16, 2013 at 01:46:14PM -0400, Jim McDonough wrote:

> On Tue, Apr 16, 2013 at 10:00 AM, Jim McDonough <[hidden email]> wrote:
> > On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <[hidden email]> wrote:
> >> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
> >>> (Re-posting on this email list per Jeremy Allison's request.)
> >>>
> >>> I am trying to figure out why winbind is using 100% CPU on my file server.
> >>> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
> >>> start winbind, however after a while it begins using 100% CPU. I haven't
> >>> been able to narrow down what triggers this CPU usage spike, but I did
> >>> attach the GNU debugger to find out what's going on in the process. The
> >>> backtrace revealed this information:
> >>>
> >>> #0  0x000000000041cf30 in _talloc_free@plt ()
> >>> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
> >>> #2  0x00000000004524e6 in fork_domain_child ()
> >>> #3  0x0000000000453585 in wb_child_request_trigger ()
> >>> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
> >>> /lib64/libtevent.so.0
> >>> #5  0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
> >>> #6  0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
> >>> #7  0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
> >>> #8  0x000000000042049a in main ()
> >>>
> >>> Apparently it's stuck in the winbindd_reinit_after_fork (and more
> >>> specifically the _talloc_free function). This code resides in
> >>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
> >>
> >> That looks like corrupted memory - probably a loop
> >> in the talloc tree.
> > I've got a user who sees this and we're adding the same dlinklist
> > element twice, creating a loop in the winbind child list.
> >
> > I've got a broken wrist so responses take a while, but that's my
> > current hint.  On 3.6.3 and 3.6.13.
> >
>
> I see this in the parent winbind log.  The last 3 entries are changes
> I made to the dlist macros within winbind only.  you can see a 2
> second delay and then a second add of the same item to the child
> winbind list.  No entries in between (and a production system so there
> is reluctance to increase the debug level).
>
> [2013/03/14 09:08:57.795585,  3]
> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>   getpwnam xx+xxxxxxxxx-112$
> [2013/03/14 09:08:57.796185,  3]
> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>   getpwnam xx+yyyyyyy
> [2013/03/14 09:09:00.077660,  0]
> winbindd/winbindd_dual.c:1399(fork_domain_child)
>   adding 0x7ff91105a000 to list at 0x7ff911053510
> [2013/03/14 09:09:02.435367,  0]
> winbindd/winbindd_dual.c:1399(fork_domain_child)
>   adding 0x7ff91105a000 to list at 0x7ff91105a000
> [2013/03/14 09:09:02.435510,  0] lib/util.c:1117(smb_panic)
>   PANIC (pid 28628): duplicate

Jim, do your changes lead to a smb_panic on an
DLIST_ADD of a duplicate ? If so, can you post
a complete backtrace ?

Jeremy.
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Jim McDonough-2
On Tue, Apr 16, 2013 at 2:45 PM, Jeremy Allison <[hidden email]> wrote:

> On Tue, Apr 16, 2013 at 01:46:14PM -0400, Jim McDonough wrote:
>> On Tue, Apr 16, 2013 at 10:00 AM, Jim McDonough <[hidden email]> wrote:
>> > On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <[hidden email]> wrote:
>> >> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
>> >>> (Re-posting on this email list per Jeremy Allison's request.)
>> >>>
>> >>> I am trying to figure out why winbind is using 100% CPU on my file server.
>> >>> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
>> >>> start winbind, however after a while it begins using 100% CPU. I haven't
>> >>> been able to narrow down what triggers this CPU usage spike, but I did
>> >>> attach the GNU debugger to find out what's going on in the process. The
>> >>> backtrace revealed this information:
>> >>>
>> >>> #0  0x000000000041cf30 in _talloc_free@plt ()
>> >>> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
>> >>> #2  0x00000000004524e6 in fork_domain_child ()
>> >>> #3  0x0000000000453585 in wb_child_request_trigger ()
>> >>> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
>> >>> /lib64/libtevent.so.0
>> >>> #5  0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
>> >>> #6  0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
>> >>> #7  0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
>> >>> #8  0x000000000042049a in main ()
>> >>>
>> >>> Apparently it's stuck in the winbindd_reinit_after_fork (and more
>> >>> specifically the _talloc_free function). This code resides in
>> >>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
>> >>
>> >> That looks like corrupted memory - probably a loop
>> >> in the talloc tree.
>> > I've got a user who sees this and we're adding the same dlinklist
>> > element twice, creating a loop in the winbind child list.
>> >
>> > I've got a broken wrist so responses take a while, but that's my
>> > current hint.  On 3.6.3 and 3.6.13.
>> >
>>
>> I see this in the parent winbind log.  The last 3 entries are changes
>> I made to the dlist macros within winbind only.  you can see a 2
>> second delay and then a second add of the same item to the child
>> winbind list.  No entries in between (and a production system so there
>> is reluctance to increase the debug level).
>>
>> [2013/03/14 09:08:57.795585,  3]
>> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>>   getpwnam xx+xxxxxxxxx-112$
>> [2013/03/14 09:08:57.796185,  3]
>> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>>   getpwnam xx+yyyyyyy
>> [2013/03/14 09:09:00.077660,  0]
>> winbindd/winbindd_dual.c:1399(fork_domain_child)
>>   adding 0x7ff91105a000 to list at 0x7ff911053510
>> [2013/03/14 09:09:02.435367,  0]
>> winbindd/winbindd_dual.c:1399(fork_domain_child)
>>   adding 0x7ff91105a000 to list at 0x7ff91105a000
>> [2013/03/14 09:09:02.435510,  0] lib/util.c:1117(smb_panic)
>>   PANIC (pid 28628): duplicate
>
> Jim, do your changes lead to a smb_panic on an
> DLIST_ADD of a duplicate ? If so, can you post
> a complete backtrace ?
>
> Jeremy.

 [2013/03/14 09:09:02.437394,  0] lib/util.c:1221(log_stack_trace)
   BACKTRACE: 11 stack frames:
    #0 /usr/sbin/winbindd(log_stack_trace+0x1a) [0x7f02379883ba]
    #1 /usr/sbin/winbindd(smb_panic+0x2b) [0x7f023798848b]
    #2 /usr/sbin/winbindd(+0x10806e) [0x7f02378d206e]
    #3 /usr/sbin/winbindd(+0x1084b5) [0x7f02378d24b5]
    #4 /usr/sbin/winbindd(tevent_common_loop_immediate+0xea) [0x7f023799917a]
    #5 /usr/sbin/winbindd(run_events_poll+0x3b) [0x7f02379973db]
    #6 /usr/sbin/winbindd(+0x1cdb12) [0x7f0237997b12]
    #7 /usr/sbin/winbindd(_tevent_loop_once+0x90) [0x7f0237997f10]
    #8 /usr/sbin/winbindd(main+0x7db) [0x7f02378aa13b]
    #9 /lib64/libc.so.6(__libc_start_main+0xe6) [0x7f0234a2ec36]
    #10 /usr/sbin/winbindd(+0xdd839) [0x7f02378a7839]


Unfortunately their core file doesn't seem quite right...this was out
of their logs and I'm having trouble with the core, but this fscking
splint keeps me moving at a snail's pace


--
Jim McDonough
Samba Team
SUSE labs
jmcd at samba dot org
jmcd at themcdonoughs dot org
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Andreas Schneider-15
On Tuesday 16 April 2013 15:53:35 Jim McDonough wrote:

> On Tue, Apr 16, 2013 at 2:45 PM, Jeremy Allison <[hidden email]> wrote:
> > On Tue, Apr 16, 2013 at 01:46:14PM -0400, Jim McDonough wrote:
> >> On Tue, Apr 16, 2013 at 10:00 AM, Jim McDonough <[hidden email]> wrote:
> >> > On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <[hidden email]> wrote:
> >> >> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
> >> >>> (Re-posting on this email list per Jeremy Allison's request.)
> >> >>>
> >> >>> I am trying to figure out why winbind is using 100% CPU on my file
> >> >>> server.
> >> >>> I am using Samba version 4.0.4. Everything is fine for a few minutes
> >> >>> when I
> >> >>> start winbind, however after a while it begins using 100% CPU. I
> >> >>> haven't
> >> >>> been able to narrow down what triggers this CPU usage spike, but I
> >> >>> did
> >> >>> attach the GNU debugger to find out what's going on in the process.
> >> >>> The
> >> >>> backtrace revealed this information:
> >> >>>
> >> >>> #0  0x000000000041cf30 in _talloc_free@plt ()
> >> >>> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
> >> >>> #2  0x00000000004524e6 in fork_domain_child ()
> >> >>> #3  0x0000000000453585 in wb_child_request_trigger ()
> >> >>> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
> >> >>> /lib64/libtevent.so.0
> >> >>> #5  0x00007fbed6b98e17 in run_events_poll () from
> >> >>> /lib64/libsmbconf.so.0
> >> >>> #6  0x00007fbed6b9922e in s3_event_loop_once () from
> >> >>> /lib64/libsmbconf.so.0
> >> >>> #7  0x000000381d204060 in _tevent_loop_once () from
> >> >>> /lib64/libtevent.so.0
> >> >>> #8  0x000000000042049a in main ()
> >> >>>
> >> >>> Apparently it's stuck in the winbindd_reinit_after_fork (and more
> >> >>> specifically the _talloc_free function). This code resides in
> >> >>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
> >> >>
> >> >> That looks like corrupted memory - probably a loop
> >> >> in the talloc tree.
> >> >
> >> > I've got a user who sees this and we're adding the same dlinklist
> >> > element twice, creating a loop in the winbind child list.
> >> >
> >> > I've got a broken wrist so responses take a while, but that's my
> >> > current hint.  On 3.6.3 and 3.6.13.
> >>
> >> I see this in the parent winbind log.  The last 3 entries are changes
> >> I made to the dlist macros within winbind only.  you can see a 2
> >> second delay and then a second add of the same item to the child
> >> winbind list.  No entries in between (and a production system so there
> >> is reluctance to increase the debug level).
> >>
> >> [2013/03/14 09:08:57.795585,  3]
> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
> >>
> >>   getpwnam xx+xxxxxxxxx-112$
> >>
> >> [2013/03/14 09:08:57.796185,  3]
> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
> >>
> >>   getpwnam xx+yyyyyyy
> >>
> >> [2013/03/14 09:09:00.077660,  0]
> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
> >>
> >>   adding 0x7ff91105a000 to list at 0x7ff911053510
> >>
> >> [2013/03/14 09:09:02.435367,  0]
> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
> >>
> >>   adding 0x7ff91105a000 to list at 0x7ff91105a000
> >>
> >> [2013/03/14 09:09:02.435510,  0] lib/util.c:1117(smb_panic)
> >>
> >>   PANIC (pid 28628): duplicate
> >
> > Jim, do your changes lead to a smb_panic on an
> > DLIST_ADD of a duplicate ? If so, can you post
> > a complete backtrace ?
> >
> > Jeremy.
>
>  [2013/03/14 09:09:02.437394,  0] lib/util.c:1221(log_stack_trace)
>    BACKTRACE: 11 stack frames:
>     #0 /usr/sbin/winbindd(log_stack_trace+0x1a) [0x7f02379883ba]
>     #1 /usr/sbin/winbindd(smb_panic+0x2b) [0x7f023798848b]
>     #2 /usr/sbin/winbindd(+0x10806e) [0x7f02378d206e]
>     #3 /usr/sbin/winbindd(+0x1084b5) [0x7f02378d24b5]
>     #4 /usr/sbin/winbindd(tevent_common_loop_immediate+0xea)
> [0x7f023799917a] #5 /usr/sbin/winbindd(run_events_poll+0x3b)
> [0x7f02379973db]
>     #6 /usr/sbin/winbindd(+0x1cdb12) [0x7f0237997b12]
>     #7 /usr/sbin/winbindd(_tevent_loop_once+0x90) [0x7f0237997f10]
>     #8 /usr/sbin/winbindd(main+0x7db) [0x7f02378aa13b]
>     #9 /lib64/libc.so.6(__libc_start_main+0xe6) [0x7f0234a2ec36]
>     #10 /usr/sbin/winbindd(+0xdd839) [0x7f02378a7839]
>
>
> Unfortunately their core file doesn't seem quite right...this was out
> of their logs and I'm having trouble with the core, but this fscking
> splint keeps me moving at a snail's pace

Jim, can you port this patch to master and ask Dylan to run with the patch so
we get a full backtrace?


        -- andreas

--
Andreas Schneider                   GPG-ID: F33E3FC6
Samba Team                             [hidden email]
www.samba.org

Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Jim McDonough-2
On Wed, Apr 17, 2013 at 2:02 AM, Andreas Schneider <[hidden email]> wrote:

> On Tuesday 16 April 2013 15:53:35 Jim McDonough wrote:
>> On Tue, Apr 16, 2013 at 2:45 PM, Jeremy Allison <[hidden email]> wrote:
>> > On Tue, Apr 16, 2013 at 01:46:14PM -0400, Jim McDonough wrote:
>> >> On Tue, Apr 16, 2013 at 10:00 AM, Jim McDonough <[hidden email]> wrote:
>> >> > On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <[hidden email]> wrote:
>> >> >> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
>> >> >>> (Re-posting on this email list per Jeremy Allison's request.)
>> >> >>>
>> >> >>> I am trying to figure out why winbind is using 100% CPU on my file
>> >> >>> server.
>> >> >>> I am using Samba version 4.0.4. Everything is fine for a few minutes
>> >> >>> when I
>> >> >>> start winbind, however after a while it begins using 100% CPU. I
>> >> >>> haven't
>> >> >>> been able to narrow down what triggers this CPU usage spike, but I
>> >> >>> did
>> >> >>> attach the GNU debugger to find out what's going on in the process.
>> >> >>> The
>> >> >>> backtrace revealed this information:
>> >> >>>
>> >> >>> #0  0x000000000041cf30 in _talloc_free@plt ()
>> >> >>> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
>> >> >>> #2  0x00000000004524e6 in fork_domain_child ()
>> >> >>> #3  0x0000000000453585 in wb_child_request_trigger ()
>> >> >>> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
>> >> >>> /lib64/libtevent.so.0
>> >> >>> #5  0x00007fbed6b98e17 in run_events_poll () from
>> >> >>> /lib64/libsmbconf.so.0
>> >> >>> #6  0x00007fbed6b9922e in s3_event_loop_once () from
>> >> >>> /lib64/libsmbconf.so.0
>> >> >>> #7  0x000000381d204060 in _tevent_loop_once () from
>> >> >>> /lib64/libtevent.so.0
>> >> >>> #8  0x000000000042049a in main ()
>> >> >>>
>> >> >>> Apparently it's stuck in the winbindd_reinit_after_fork (and more
>> >> >>> specifically the _talloc_free function). This code resides in
>> >> >>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
>> >> >>
>> >> >> That looks like corrupted memory - probably a loop
>> >> >> in the talloc tree.
>> >> >
>> >> > I've got a user who sees this and we're adding the same dlinklist
>> >> > element twice, creating a loop in the winbind child list.
>> >> >
>> >> > I've got a broken wrist so responses take a while, but that's my
>> >> > current hint.  On 3.6.3 and 3.6.13.
>> >>
>> >> I see this in the parent winbind log.  The last 3 entries are changes
>> >> I made to the dlist macros within winbind only.  you can see a 2
>> >> second delay and then a second add of the same item to the child
>> >> winbind list.  No entries in between (and a production system so there
>> >> is reluctance to increase the debug level).
>> >>
>> >> [2013/03/14 09:08:57.795585,  3]
>> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>> >>
>> >>   getpwnam xx+xxxxxxxxx-112$
>> >>
>> >> [2013/03/14 09:08:57.796185,  3]
>> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>> >>
>> >>   getpwnam xx+yyyyyyy
>> >>
>> >> [2013/03/14 09:09:00.077660,  0]
>> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
>> >>
>> >>   adding 0x7ff91105a000 to list at 0x7ff911053510
>> >>
>> >> [2013/03/14 09:09:02.435367,  0]
>> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
>> >>
>> >>   adding 0x7ff91105a000 to list at 0x7ff91105a000
>> >>
>> >> [2013/03/14 09:09:02.435510,  0] lib/util.c:1117(smb_panic)
>> >>
>> >>   PANIC (pid 28628): duplicate
>> >
>> > Jim, do your changes lead to a smb_panic on an
>> > DLIST_ADD of a duplicate ? If so, can you post
>> > a complete backtrace ?
>> >
>> > Jeremy.
>>
>>  [2013/03/14 09:09:02.437394,  0] lib/util.c:1221(log_stack_trace)
>>    BACKTRACE: 11 stack frames:
>>     #0 /usr/sbin/winbindd(log_stack_trace+0x1a) [0x7f02379883ba]
>>     #1 /usr/sbin/winbindd(smb_panic+0x2b) [0x7f023798848b]
>>     #2 /usr/sbin/winbindd(+0x10806e) [0x7f02378d206e]
>>     #3 /usr/sbin/winbindd(+0x1084b5) [0x7f02378d24b5]
>>     #4 /usr/sbin/winbindd(tevent_common_loop_immediate+0xea)
>> [0x7f023799917a] #5 /usr/sbin/winbindd(run_events_poll+0x3b)
>> [0x7f02379973db]
>>     #6 /usr/sbin/winbindd(+0x1cdb12) [0x7f0237997b12]
>>     #7 /usr/sbin/winbindd(_tevent_loop_once+0x90) [0x7f0237997f10]
>>     #8 /usr/sbin/winbindd(main+0x7db) [0x7f02378aa13b]
>>     #9 /lib64/libc.so.6(__libc_start_main+0xe6) [0x7f0234a2ec36]
>>     #10 /usr/sbin/winbindd(+0xdd839) [0x7f02378a7839]
>>
>>
>> Unfortunately their core file doesn't seem quite right...this was out
>> of their logs and I'm having trouble with the core, but this fscking
>> splint keeps me moving at a snail's pace
>
> Jim, can you port this patch to master and ask Dylan to run with the patch so
> we get a full backtrace?
Here it is on master...




--
Jim McDonough
Samba Team
SUSE labs
jmcd at samba dot org
jmcd at themcdonoughs dot org

0001-Panic-on-adding-a-duplicate-dlist-item-in-winbind.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Dylan Klomparens
The patch has exposed the problem. I've attached the output from winbind.
Please let me know if you need additional information!

-- Dylan


On Wed, Apr 17, 2013 at 8:26 AM, Jim McDonough <[hidden email]> wrote:

> On Wed, Apr 17, 2013 at 2:02 AM, Andreas Schneider <[hidden email]> wrote:
> > On Tuesday 16 April 2013 15:53:35 Jim McDonough wrote:
> >> On Tue, Apr 16, 2013 at 2:45 PM, Jeremy Allison <[hidden email]> wrote:
> >> > On Tue, Apr 16, 2013 at 01:46:14PM -0400, Jim McDonough wrote:
> >> >> On Tue, Apr 16, 2013 at 10:00 AM, Jim McDonough <[hidden email]>
> wrote:
> >> >> > On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <[hidden email]>
> wrote:
> >> >> >> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
> >> >> >>> (Re-posting on this email list per Jeremy Allison's request.)
> >> >> >>>
> >> >> >>> I am trying to figure out why winbind is using 100% CPU on my
> file
> >> >> >>> server.
> >> >> >>> I am using Samba version 4.0.4. Everything is fine for a few
> minutes
> >> >> >>> when I
> >> >> >>> start winbind, however after a while it begins using 100% CPU. I
> >> >> >>> haven't
> >> >> >>> been able to narrow down what triggers this CPU usage spike, but
> I
> >> >> >>> did
> >> >> >>> attach the GNU debugger to find out what's going on in the
> process.
> >> >> >>> The
> >> >> >>> backtrace revealed this information:
> >> >> >>>
> >> >> >>> #0  0x000000000041cf30 in _talloc_free@plt ()
> >> >> >>> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
> >> >> >>> #2  0x00000000004524e6 in fork_domain_child ()
> >> >> >>> #3  0x0000000000453585 in wb_child_request_trigger ()
> >> >> >>> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
> >> >> >>> /lib64/libtevent.so.0
> >> >> >>> #5  0x00007fbed6b98e17 in run_events_poll () from
> >> >> >>> /lib64/libsmbconf.so.0
> >> >> >>> #6  0x00007fbed6b9922e in s3_event_loop_once () from
> >> >> >>> /lib64/libsmbconf.so.0
> >> >> >>> #7  0x000000381d204060 in _tevent_loop_once () from
> >> >> >>> /lib64/libtevent.so.0
> >> >> >>> #8  0x000000000042049a in main ()
> >> >> >>>
> >> >> >>> Apparently it's stuck in the winbindd_reinit_after_fork (and more
> >> >> >>> specifically the _talloc_free function). This code resides in
> >> >> >>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
> >> >> >>
> >> >> >> That looks like corrupted memory - probably a loop
> >> >> >> in the talloc tree.
> >> >> >
> >> >> > I've got a user who sees this and we're adding the same dlinklist
> >> >> > element twice, creating a loop in the winbind child list.
> >> >> >
> >> >> > I've got a broken wrist so responses take a while, but that's my
> >> >> > current hint.  On 3.6.3 and 3.6.13.
> >> >>
> >> >> I see this in the parent winbind log.  The last 3 entries are changes
> >> >> I made to the dlist macros within winbind only.  you can see a 2
> >> >> second delay and then a second add of the same item to the child
> >> >> winbind list.  No entries in between (and a production system so
> there
> >> >> is reluctance to increase the debug level).
> >> >>
> >> >> [2013/03/14 09:08:57.795585,  3]
> >> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
> >> >>
> >> >>   getpwnam xx+xxxxxxxxx-112$
> >> >>
> >> >> [2013/03/14 09:08:57.796185,  3]
> >> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
> >> >>
> >> >>   getpwnam xx+yyyyyyy
> >> >>
> >> >> [2013/03/14 09:09:00.077660,  0]
> >> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
> >> >>
> >> >>   adding 0x7ff91105a000 to list at 0x7ff911053510
> >> >>
> >> >> [2013/03/14 09:09:02.435367,  0]
> >> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
> >> >>
> >> >>   adding 0x7ff91105a000 to list at 0x7ff91105a000
> >> >>
> >> >> [2013/03/14 09:09:02.435510,  0] lib/util.c:1117(smb_panic)
> >> >>
> >> >>   PANIC (pid 28628): duplicate
> >> >
> >> > Jim, do your changes lead to a smb_panic on an
> >> > DLIST_ADD of a duplicate ? If so, can you post
> >> > a complete backtrace ?
> >> >
> >> > Jeremy.
> >>
> >>  [2013/03/14 09:09:02.437394,  0] lib/util.c:1221(log_stack_trace)
> >>    BACKTRACE: 11 stack frames:
> >>     #0 /usr/sbin/winbindd(log_stack_trace+0x1a) [0x7f02379883ba]
> >>     #1 /usr/sbin/winbindd(smb_panic+0x2b) [0x7f023798848b]
> >>     #2 /usr/sbin/winbindd(+0x10806e) [0x7f02378d206e]
> >>     #3 /usr/sbin/winbindd(+0x1084b5) [0x7f02378d24b5]
> >>     #4 /usr/sbin/winbindd(tevent_common_loop_immediate+0xea)
> >> [0x7f023799917a] #5 /usr/sbin/winbindd(run_events_poll+0x3b)
> >> [0x7f02379973db]
> >>     #6 /usr/sbin/winbindd(+0x1cdb12) [0x7f0237997b12]
> >>     #7 /usr/sbin/winbindd(_tevent_loop_once+0x90) [0x7f0237997f10]
> >>     #8 /usr/sbin/winbindd(main+0x7db) [0x7f02378aa13b]
> >>     #9 /lib64/libc.so.6(__libc_start_main+0xe6) [0x7f0234a2ec36]
> >>     #10 /usr/sbin/winbindd(+0xdd839) [0x7f02378a7839]
> >>
> >>
> >> Unfortunately their core file doesn't seem quite right...this was out
> >> of their logs and I'm having trouble with the core, but this fscking
> >> splint keeps me moving at a snail's pace
> >
> > Jim, can you port this patch to master and ask Dylan to run with the
> patch so
> > we get a full backtrace?
> Here it is on master...
>
>
>
>
> --
> Jim McDonough
> Samba Team
> SUSE labs
> jmcd at samba dot org
> jmcd at themcdonoughs dot org
>

duplicate.txt (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Jim McDonough-2
On Thu, Apr 18, 2013 at 12:16 PM, Dylan Klomparens
<[hidden email]> wrote:
> The patch has exposed the problem. I've attached the output from winbind.
> Please let me know if you need additional information!
Wow, that's quick.  Any other clues, like what was going on on the
system then?  what was causing winbind to do work?  How many trusted
domains?

>
> -- Dylan
>
>
> On Wed, Apr 17, 2013 at 8:26 AM, Jim McDonough <[hidden email]> wrote:
>>
>> On Wed, Apr 17, 2013 at 2:02 AM, Andreas Schneider <[hidden email]> wrote:
>> > On Tuesday 16 April 2013 15:53:35 Jim McDonough wrote:
>> >> On Tue, Apr 16, 2013 at 2:45 PM, Jeremy Allison <[hidden email]> wrote:
>> >> > On Tue, Apr 16, 2013 at 01:46:14PM -0400, Jim McDonough wrote:
>> >> >> On Tue, Apr 16, 2013 at 10:00 AM, Jim McDonough <[hidden email]>
>> >> >> wrote:
>> >> >> > On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <[hidden email]>
>> >> >> > wrote:
>> >> >> >> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
>> >> >> >>> (Re-posting on this email list per Jeremy Allison's request.)
>> >> >> >>>
>> >> >> >>> I am trying to figure out why winbind is using 100% CPU on my
>> >> >> >>> file
>> >> >> >>> server.
>> >> >> >>> I am using Samba version 4.0.4. Everything is fine for a few
>> >> >> >>> minutes
>> >> >> >>> when I
>> >> >> >>> start winbind, however after a while it begins using 100% CPU. I
>> >> >> >>> haven't
>> >> >> >>> been able to narrow down what triggers this CPU usage spike, but
>> >> >> >>> I
>> >> >> >>> did
>> >> >> >>> attach the GNU debugger to find out what's going on in the
>> >> >> >>> process.
>> >> >> >>> The
>> >> >> >>> backtrace revealed this information:
>> >> >> >>>
>> >> >> >>> #0  0x000000000041cf30 in _talloc_free@plt ()
>> >> >> >>> #1  0x0000000000452320 in winbindd_reinit_after_fork ()
>> >> >> >>> #2  0x00000000004524e6 in fork_domain_child ()
>> >> >> >>> #3  0x0000000000453585 in wb_child_request_trigger ()
>> >> >> >>> #4  0x000000381d2048e2 in tevent_common_loop_immediate () from
>> >> >> >>> /lib64/libtevent.so.0
>> >> >> >>> #5  0x00007fbed6b98e17 in run_events_poll () from
>> >> >> >>> /lib64/libsmbconf.so.0
>> >> >> >>> #6  0x00007fbed6b9922e in s3_event_loop_once () from
>> >> >> >>> /lib64/libsmbconf.so.0
>> >> >> >>> #7  0x000000381d204060 in _tevent_loop_once () from
>> >> >> >>> /lib64/libtevent.so.0
>> >> >> >>> #8  0x000000000042049a in main ()
>> >> >> >>>
>> >> >> >>> Apparently it's stuck in the winbindd_reinit_after_fork (and
>> >> >> >>> more
>> >> >> >>> specifically the _talloc_free function). This code resides in
>> >> >> >>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
>> >> >> >>
>> >> >> >> That looks like corrupted memory - probably a loop
>> >> >> >> in the talloc tree.
>> >> >> >
>> >> >> > I've got a user who sees this and we're adding the same dlinklist
>> >> >> > element twice, creating a loop in the winbind child list.
>> >> >> >
>> >> >> > I've got a broken wrist so responses take a while, but that's my
>> >> >> > current hint.  On 3.6.3 and 3.6.13.
>> >> >>
>> >> >> I see this in the parent winbind log.  The last 3 entries are
>> >> >> changes
>> >> >> I made to the dlist macros within winbind only.  you can see a 2
>> >> >> second delay and then a second add of the same item to the child
>> >> >> winbind list.  No entries in between (and a production system so
>> >> >> there
>> >> >> is reluctance to increase the debug level).
>> >> >>
>> >> >> [2013/03/14 09:08:57.795585,  3]
>> >> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>> >> >>
>> >> >>   getpwnam xx+xxxxxxxxx-112$
>> >> >>
>> >> >> [2013/03/14 09:08:57.796185,  3]
>> >> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>> >> >>
>> >> >>   getpwnam xx+yyyyyyy
>> >> >>
>> >> >> [2013/03/14 09:09:00.077660,  0]
>> >> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
>> >> >>
>> >> >>   adding 0x7ff91105a000 to list at 0x7ff911053510
>> >> >>
>> >> >> [2013/03/14 09:09:02.435367,  0]
>> >> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
>> >> >>
>> >> >>   adding 0x7ff91105a000 to list at 0x7ff91105a000
>> >> >>
>> >> >> [2013/03/14 09:09:02.435510,  0] lib/util.c:1117(smb_panic)
>> >> >>
>> >> >>   PANIC (pid 28628): duplicate
>> >> >
>> >> > Jim, do your changes lead to a smb_panic on an
>> >> > DLIST_ADD of a duplicate ? If so, can you post
>> >> > a complete backtrace ?
>> >> >
>> >> > Jeremy.
>> >>
>> >>  [2013/03/14 09:09:02.437394,  0] lib/util.c:1221(log_stack_trace)
>> >>    BACKTRACE: 11 stack frames:
>> >>     #0 /usr/sbin/winbindd(log_stack_trace+0x1a) [0x7f02379883ba]
>> >>     #1 /usr/sbin/winbindd(smb_panic+0x2b) [0x7f023798848b]
>> >>     #2 /usr/sbin/winbindd(+0x10806e) [0x7f02378d206e]
>> >>     #3 /usr/sbin/winbindd(+0x1084b5) [0x7f02378d24b5]
>> >>     #4 /usr/sbin/winbindd(tevent_common_loop_immediate+0xea)
>> >> [0x7f023799917a] #5 /usr/sbin/winbindd(run_events_poll+0x3b)
>> >> [0x7f02379973db]
>> >>     #6 /usr/sbin/winbindd(+0x1cdb12) [0x7f0237997b12]
>> >>     #7 /usr/sbin/winbindd(_tevent_loop_once+0x90) [0x7f0237997f10]
>> >>     #8 /usr/sbin/winbindd(main+0x7db) [0x7f02378aa13b]
>> >>     #9 /lib64/libc.so.6(__libc_start_main+0xe6) [0x7f0234a2ec36]
>> >>     #10 /usr/sbin/winbindd(+0xdd839) [0x7f02378a7839]
>> >>
>> >>
>> >> Unfortunately their core file doesn't seem quite right...this was out
>> >> of their logs and I'm having trouble with the core, but this fscking
>> >> splint keeps me moving at a snail's pace
>> >
>> > Jim, can you port this patch to master and ask Dylan to run with the
>> > patch so
>> > we get a full backtrace?
>> Here it is on master...
>>
>>
>>
>>
>> --
>> Jim McDonough
>> Samba Team
>> SUSE labs
>> jmcd at samba dot org
>> jmcd at themcdonoughs dot org
>
>



--
Jim McDonough
Samba Team
SUSE labs
jmcd at samba dot org
jmcd at themcdonoughs dot org
Reply | Threaded
Open this post in threaded view
|

Re: Winbind using 100% CPU

Nimrod Sapir
Jim McDonough <jmcd <at> samba.org> writes:

>
> On Thu, Apr 18, 2013 at 12:16 PM, Dylan Klomparens
> <dylan.klomparens <at> gmail.com> wrote:
> > The patch has exposed the problem. I've attached the output from winbind.
> > Please let me know if you need additional information!
> Wow, that's quick.  Any other clues, like what was going on on the
> system then?  what was causing winbind to do work?  How many trusted
> domains?
>
> >
> > -- Dylan
> >
> >

I seem to have the exact same problem in my environment - I have three
domains, and it seems that at some point multiple winbindd instances go into
infinite loop in the same location in winbindd_dual, because of duplication
in the winbindd_child list:

(gdb) print winbindd_children
$1 = (struct winbindd_child *) 0x74bb58
(gdb) print winbindd_children->next
$2 = (struct winbindd_child *) 0x74bb58

Is there any progress of investigating it? I will keep my logs/core dumps
and provide them if needed.

Thanks!
Nimrod