Re: smbd crash in a CTDB cluster

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: smbd crash in a CTDB cluster

Samba - samba-technical mailing list
+samba-technical

On 2017-08-11 22:44, [hidden email] wrote:

> Hi all,
>
> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
> crashes were seen with the following back trace:
>
> Core was generated by `/usr/sbin/smbd'.
> Program terminated with signal 6, Aborted.
> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
> #2  0x00007f1d286d04de in dump_core () at ../source3/lib/dumpcore.c:338
> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
> ../source3/lib/util.c:814
> #4  0x00007f1d2a79c95f in smb_panic (why=why@entry=0x7f1d2a7e482a
> "internal error") at ../lib/util/fault.c:166
> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
> ../lib/util/fault.c:83
> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
> #7  <signal handler called>
> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800,
> mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
>     at ../source3/lib/messages_ctdbd.c:278
> #9  0x00007f1d286ccd40 in messaging_reinit
> (msg_ctx=msg_ctx@entry=0x56508d0e3800) at
> ../source3/lib/messages.c:415
> #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800,
> ev_ctx=<optimized out>,
>     parent_longlived=parent_longlived@entry=true,
> comment=comment@entry=0x0) at ../source3/lib/util.c:475
> #11 0x00007f1d286dbafa in background_job_waited
> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
> out>) at ../tevent_epoll.c:930
> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>     at ../tevent_standard.c:114
> #16 0x00007f1d270dd0cd in _tevent_loop_once
> (ev=ev@entry=0x56508d0e2d10,
>     location=location@entry=0x56508bde85d9
> "../source3/smbd/server.c:1384") at ../tevent.c:721
> #17 0x00007f1d270dd2fb in tevent_common_loop_wait (ev=0x56508d0e2d10,
> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>     at ../tevent.c:844
> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>     at ../tevent_standard.c:145
> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
> #20 main (argc=<optimized out>, argv=<optimized out>) at
> ../source3/smbd/server.c:2038
>
> (gdb) f 8
> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800,
> mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
>     at ../source3/lib/messages_ctdbd.c:278
> 278 struct messaging_ctdbd_context *ctx = talloc_get_type_abort(
>
> (gdb) l
> 273
> 274 int messaging_ctdbd_reinit(struct messaging_context *msg_ctx,
> 275   TALLOC_CTX *mem_ctx,
> 276   struct messaging_backend *backend)
> 277 {
> 278 struct messaging_ctdbd_context *ctx = talloc_get_type_abort(
> 279 backend->private_data, struct messaging_ctdbd_context);
> 280 int ret;
> 281
> 282 ret = messaging_ctdbd_init_internal(msg_ctx, mem_ctx, ctx, true);
>
> (gdb) p backend
> $1 = (struct messaging_backend *) 0x0
>
> (gdb) p *msg_ctx
> $1 = {id = {pid = 17264, task_id = 0, vnn = 4294967295, unique_id =
> 4569628117635137227}, event_ctx = 0x56508d0e2d10,
>   callbacks = 0x56508d0fa250, new_waiters = 0x0, num_new_waiters = 0,
> waiters = 0x0, num_waiters = 0, msg_dgm_ref = 0x56508d0e6ac0,
>   remote = 0x0, names_db = 0x56508d0e3cf0}
>
> Since core files were observed later it is hard to recollect the
> scenario which could have caused smbd to panic and dump the core.
> Please find corresponding logs attached to this mail(log level is
> default. not very helpful). Is there any chance by which
> msg_ctx->remote can be NULL in this code path? Also the value for vnn
> also looks strange..
>
> Anoop C S

log.smbd (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: smbd crash in a CTDB cluster

Samba - samba-technical mailing list
On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
<[hidden email]> wrote:

> +samba-technical
>
> On 2017-08-11 22:44, [hidden email] wrote:
>>
>> Hi all,
>>
>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>> crashes were seen with the following back trace:
>>
>> Core was generated by `/usr/sbin/smbd'.
>> Program terminated with signal 6, Aborted.
>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>> (gdb) bt
>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>> #2  0x00007f1d286d04de in dump_core () at ../source3/lib/dumpcore.c:338
>> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>> ../source3/lib/util.c:814
>> #4  0x00007f1d2a79c95f in smb_panic (why=why@entry=0x7f1d2a7e482a
>> "internal error") at ../lib/util/fault.c:166
>> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>> ../lib/util/fault.c:83
>> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>> #7  <signal handler called>
>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800,
>> mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
>>     at ../source3/lib/messages_ctdbd.c:278
>> #9  0x00007f1d286ccd40 in messaging_reinit
>> (msg_ctx=msg_ctx@entry=0x56508d0e3800) at
>> ../source3/lib/messages.c:415
>> #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800,
>> ev_ctx=<optimized out>,
>>     parent_longlived=parent_longlived@entry=true,
>> comment=comment@entry=0x0) at ../source3/lib/util.c:475
>> #11 0x00007f1d286dbafa in background_job_waited
>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>> out>) at ../tevent_epoll.c:930
>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>     at ../tevent_standard.c:114
>> #16 0x00007f1d270dd0cd in _tevent_loop_once (ev=ev@entry=0x56508d0e2d10,
>>     location=location@entry=0x56508bde85d9
>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait (ev=0x56508d0e2d10,
>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>     at ../tevent.c:844
>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>     at ../tevent_standard.c:145
>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>> ../source3/smbd/server.c:2038

This is quite normal if the node was banned when the smbd was forked.
What does the ctdb log show? What was happening at that time?

>> (gdb) f 8
>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800,
>> mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
>>     at ../source3/lib/messages_ctdbd.c:278
>> 278             struct messaging_ctdbd_context *ctx =
>> talloc_get_type_abort(
>>
>> (gdb) l
>> 273
>> 274     int messaging_ctdbd_reinit(struct messaging_context *msg_ctx,
>> 275                                TALLOC_CTX *mem_ctx,
>> 276                                struct messaging_backend *backend)
>> 277     {
>> 278             struct messaging_ctdbd_context *ctx =
>> talloc_get_type_abort(
>> 279                     backend->private_data, struct
>> messaging_ctdbd_context);
>> 280             int ret;
>> 281
>> 282             ret = messaging_ctdbd_init_internal(msg_ctx, mem_ctx, ctx,
>> true);
>>
>> (gdb) p backend
>> $1 = (struct messaging_backend *) 0x0
>>
>> (gdb) p *msg_ctx
>> $1 = {id = {pid = 17264, task_id = 0, vnn = 4294967295, unique_id =
>> 4569628117635137227}, event_ctx = 0x56508d0e2d10,
>>   callbacks = 0x56508d0fa250, new_waiters = 0x0, num_new_waiters = 0,
>> waiters = 0x0, num_waiters = 0, msg_dgm_ref = 0x56508d0e6ac0,
>>   remote = 0x0, names_db = 0x56508d0e3cf0}
>>
>> Since core files were observed later it is hard to recollect the
>> scenario which could have caused smbd to panic and dump the core.
>> Please find corresponding logs attached to this mail(log level is
>> default. not very helpful). Is there any chance by which
>> msg_ctx->remote can be NULL in this code path? Also the value for vnn
>> also looks strange..
>>
>> Anoop C S



--
Regards,
Richard Sharpe
(何以解憂?唯有杜康。--曹操)

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: smbd crash in a CTDB cluster

Samba - samba-technical mailing list
On 2017-08-11 22:47, Richard Sharpe wrote:

> On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
> <[hidden email]> wrote:
>> +samba-technical
>>
>> On 2017-08-11 22:44, [hidden email] wrote:
>>>
>>> Hi all,
>>>
>>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>>> crashes were seen with the following back trace:
>>>
>>> Core was generated by `/usr/sbin/smbd'.
>>> Program terminated with signal 6, Aborted.
>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>> (gdb) bt
>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>>> #2  0x00007f1d286d04de in dump_core () at
>>> ../source3/lib/dumpcore.c:338
>>> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>>> ../source3/lib/util.c:814
>>> #4  0x00007f1d2a79c95f in smb_panic (why=why@entry=0x7f1d2a7e482a
>>> "internal error") at ../lib/util/fault.c:166
>>> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>>> ../lib/util/fault.c:83
>>> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>>> #7  <signal handler called>
>>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800,
>>> mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
>>>     at ../source3/lib/messages_ctdbd.c:278
>>> #9  0x00007f1d286ccd40 in messaging_reinit
>>> (msg_ctx=msg_ctx@entry=0x56508d0e3800) at
>>> ../source3/lib/messages.c:415
>>> #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800,
>>> ev_ctx=<optimized out>,
>>>     parent_longlived=parent_longlived@entry=true,
>>> comment=comment@entry=0x0) at ../source3/lib/util.c:475
>>> #11 0x00007f1d286dbafa in background_job_waited
>>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>>> out>) at ../tevent_epoll.c:930
>>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>     at ../tevent_standard.c:114
>>> #16 0x00007f1d270dd0cd in _tevent_loop_once
>>> (ev=ev@entry=0x56508d0e2d10,
>>>     location=location@entry=0x56508bde85d9
>>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait (ev=0x56508d0e2d10,
>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>     at ../tevent.c:844
>>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>     at ../tevent_standard.c:145
>>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>>> ../source3/smbd/server.c:2038
>
> This is quite normal if the node was banned when the smbd was forked.
> What does the ctdb log show? What was happening at that time?

I think the logs got rotated and got cleaned up subsequently over time.
I could barely remember that the cluster was not in healthy state at
initial stage due to some network issue. In fact I am not sure whether
node/nodes were in BANNED state or not. I will try to dig that up and
confirm your analysis.

Does that mean it is a deliberate panic from smbd? I am asking this
because of the code re-factoring done in this area which introduces
talloc_get_type_abort() from 4.5 onwards.

>
>>> (gdb) f 8
>>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800,
>>> mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
>>>     at ../source3/lib/messages_ctdbd.c:278
>>> 278             struct messaging_ctdbd_context *ctx =
>>> talloc_get_type_abort(
>>>
>>> (gdb) l
>>> 273
>>> 274     int messaging_ctdbd_reinit(struct messaging_context *msg_ctx,
>>> 275                                TALLOC_CTX *mem_ctx,
>>> 276                                struct messaging_backend *backend)
>>> 277     {
>>> 278             struct messaging_ctdbd_context *ctx =
>>> talloc_get_type_abort(
>>> 279                     backend->private_data, struct
>>> messaging_ctdbd_context);
>>> 280             int ret;
>>> 281
>>> 282             ret = messaging_ctdbd_init_internal(msg_ctx, mem_ctx,
>>> ctx,
>>> true);
>>>
>>> (gdb) p backend
>>> $1 = (struct messaging_backend *) 0x0
>>>
>>> (gdb) p *msg_ctx
>>> $1 = {id = {pid = 17264, task_id = 0, vnn = 4294967295, unique_id =
>>> 4569628117635137227}, event_ctx = 0x56508d0e2d10,
>>>   callbacks = 0x56508d0fa250, new_waiters = 0x0, num_new_waiters = 0,
>>> waiters = 0x0, num_waiters = 0, msg_dgm_ref = 0x56508d0e6ac0,
>>>   remote = 0x0, names_db = 0x56508d0e3cf0}
>>>
>>> Since core files were observed later it is hard to recollect the
>>> scenario which could have caused smbd to panic and dump the core.
>>> Please find corresponding logs attached to this mail(log level is
>>> default. not very helpful). Is there any chance by which
>>> msg_ctx->remote can be NULL in this code path? Also the value for vnn
>>> also looks strange..
>>>
>>> Anoop C S

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: smbd crash in a CTDB cluster

Samba - samba-technical mailing list
On Fri, Aug 11, 2017 at 7:28 PM,  <[hidden email]> wrote:

> On 2017-08-11 22:47, Richard Sharpe wrote:
>>
>> On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
>> <[hidden email]> wrote:
>>>
>>> +samba-technical
>>>
>>> On 2017-08-11 22:44, [hidden email] wrote:
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>>>> crashes were seen with the following back trace:
>>>>
>>>> Core was generated by `/usr/sbin/smbd'.
>>>> Program terminated with signal 6, Aborted.
>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>> (gdb) bt
>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>>>> #2  0x00007f1d286d04de in dump_core () at ../source3/lib/dumpcore.c:338
>>>> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>>>> ../source3/lib/util.c:814
>>>> #4  0x00007f1d2a79c95f in smb_panic (why=why@entry=0x7f1d2a7e482a
>>>> "internal error") at ../lib/util/fault.c:166
>>>> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>>>> ../lib/util/fault.c:83
>>>> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>>>> #7  <signal handler called>
>>>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800,
>>>> mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
>>>>     at ../source3/lib/messages_ctdbd.c:278
>>>> #9  0x00007f1d286ccd40 in messaging_reinit
>>>> (msg_ctx=msg_ctx@entry=0x56508d0e3800) at
>>>> ../source3/lib/messages.c:415
>>>> #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800,
>>>> ev_ctx=<optimized out>,
>>>>     parent_longlived=parent_longlived@entry=true,
>>>> comment=comment@entry=0x0) at ../source3/lib/util.c:475
>>>> #11 0x00007f1d286dbafa in background_job_waited
>>>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>>>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>>>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>>>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>>>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>>>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>>>> out>) at ../tevent_epoll.c:930
>>>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>     at ../tevent_standard.c:114
>>>> #16 0x00007f1d270dd0cd in _tevent_loop_once (ev=ev@entry=0x56508d0e2d10,
>>>>     location=location@entry=0x56508bde85d9
>>>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>>>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait (ev=0x56508d0e2d10,
>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>     at ../tevent.c:844
>>>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>     at ../tevent_standard.c:145
>>>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>>>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>>>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>>>> ../source3/smbd/server.c:2038
>>
>>
>> This is quite normal if the node was banned when the smbd was forked.
>> What does the ctdb log show? What was happening at that time?
>
>
> I think the logs got rotated and got cleaned up subsequently over time. I
> could barely remember that the cluster was not in healthy state at initial
> stage due to some network issue. In fact I am not sure whether node/nodes
> were in BANNED state or not. I will try to dig that up and confirm your
> analysis.
>
> Does that mean it is a deliberate panic from smbd? I am asking this because
> of the code re-factoring done in this area which introduces
> talloc_get_type_abort() from 4.5 onwards.

Samba will panic in reinit_after_fork if anything fails. There are
very important things going on there, including connecting to ctdb.

--
Regards,
Richard Sharpe
(何以解憂?唯有杜康。--曹操)

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: smbd crash in a CTDB cluster

Samba - samba-technical mailing list
On 2017-08-12 03:35, Richard Sharpe wrote:

> On Fri, Aug 11, 2017 at 7:28 PM,  <[hidden email]> wrote:
>> On 2017-08-11 22:47, Richard Sharpe wrote:
>>>
>>> On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
>>> <[hidden email]> wrote:
>>>>
>>>> +samba-technical
>>>>
>>>> On 2017-08-11 22:44, [hidden email] wrote:
>>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>>>>> crashes were seen with the following back trace:
>>>>>
>>>>> Core was generated by `/usr/sbin/smbd'.
>>>>> Program terminated with signal 6, Aborted.
>>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>>> (gdb) bt
>>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>>> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>>>>> #2  0x00007f1d286d04de in dump_core () at
>>>>> ../source3/lib/dumpcore.c:338
>>>>> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>>>>> ../source3/lib/util.c:814
>>>>> #4  0x00007f1d2a79c95f in smb_panic (why=why@entry=0x7f1d2a7e482a
>>>>> "internal error") at ../lib/util/fault.c:166
>>>>> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>>>>> ../lib/util/fault.c:83
>>>>> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>>>>> #7  <signal handler called>
>>>>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800,
>>>>> mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
>>>>>     at ../source3/lib/messages_ctdbd.c:278
>>>>> #9  0x00007f1d286ccd40 in messaging_reinit
>>>>> (msg_ctx=msg_ctx@entry=0x56508d0e3800) at
>>>>> ../source3/lib/messages.c:415
>>>>> #10 0x00007f1d286c0ec9 in reinit_after_fork
>>>>> (msg_ctx=0x56508d0e3800,
>>>>> ev_ctx=<optimized out>,
>>>>>     parent_longlived=parent_longlived@entry=true,
>>>>> comment=comment@entry=0x0) at ../source3/lib/util.c:475
>>>>> #11 0x00007f1d286dbafa in background_job_waited
>>>>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>>>>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>>>>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>>>>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>>>>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>>>>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>>>>> out>) at ../tevent_epoll.c:930
>>>>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>     at ../tevent_standard.c:114
>>>>> #16 0x00007f1d270dd0cd in _tevent_loop_once
>>>>> (ev=ev@entry=0x56508d0e2d10,
>>>>>     location=location@entry=0x56508bde85d9
>>>>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>>>>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait
>>>>> (ev=0x56508d0e2d10,
>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>     at ../tevent.c:844
>>>>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>     at ../tevent_standard.c:145
>>>>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>>>>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>>>>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>>>>> ../source3/smbd/server.c:2038
>>>
>>>
>>> This is quite normal if the node was banned when the smbd was forked.
>>> What does the ctdb log show? What was happening at that time?
>>
>>
>> I think the logs got rotated and got cleaned up subsequently over
>> time. I
>> could barely remember that the cluster was not in healthy state at
>> initial
>> stage due to some network issue. In fact I am not sure whether
>> node/nodes
>> were in BANNED state or not. I will try to dig that up and confirm
>> your
>> analysis.
>>
>> Does that mean it is a deliberate panic from smbd? I am asking this
>> because
>> of the code re-factoring done in this area which introduces
>> talloc_get_type_abort() from 4.5 onwards.
>
> Samba will panic in reinit_after_fork if anything fails. There are
> very important things going on there, including connecting to ctdb.

I have seen such panics in situations where smbd is unable to connect to
ctdb when non-public IPs were used for client connection and
corresponding node is in DISCONNECTED/BANNED/STOPPED state. Following is
one of that kind:

(gdb) bt
#0  0x00007fdbcf38d1f7 in raise () from /lib64/libc.so.6
#1  0x00007fdbcf38e8e8 in abort () from /lib64/libc.so.6
#2  0x00007fdbd0d134de in dump_core () at ../source3/lib/dumpcore.c:338
#3  0x00007fdbd0d046e7 in smb_panic_s3 (why=<optimized out>) at
../source3/lib/util.c:814
#4  0x00007fdbd2ddf95f in smb_panic (why=why@entry=0x55f401ef8a50
"reinit_after_fork() failed") at ../lib/util/fault.c:166
#5  0x000055f401ef7af8 in smbd_accept_connection (ev=0x55f4023836c0,
fde=<optimized out>, flags=<optimized out>,
     private_data=<optimized out>) at ../source3/smbd/server.c:1014
#6  0x00007fdbcf725edb in epoll_event_loop (tvalp=0x7fff5b14c720,
epoll_ev=0x55f402383940) at ../tevent_epoll.c:728
#7  epoll_event_loop_once (ev=<optimized out>, location=<optimized out>)
at ../tevent_epoll.c:930
#8  0x00007fdbcf7242a7 in std_event_loop_once (ev=0x55f4023836c0,
location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
     at ../tevent_standard.c:114
#9  0x00007fdbcf7200cd in _tevent_loop_once (ev=ev@entry=0x55f4023836c0,
     location=location@entry=0x55f401efb5d9
"../source3/smbd/server.c:1384") at ../tevent.c:721
#10 0x00007fdbcf7202fb in tevent_common_loop_wait (ev=0x55f4023836c0,
location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
     at ../tevent.c:844
#11 0x00007fdbcf724247 in std_event_loop_wait (ev=0x55f4023836c0,
location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
     at ../tevent_standard.c:145
#12 0x000055f401ef2a95 in smbd_parent_loop (parent=<optimized out>,
ev_ctx=0x55f4023836c0) at ../source3/smbd/server.c:1384
#13 main (argc=<optimized out>, argv=<optimized out>) at
../source3/smbd/server.c:2038

The above backtrace is different from what I have observed right now as
the smb_panic message says its an "internal error" rather than
"reinit_after_fork() failed". Moreover the presence of
tevent_common_loop_timer_delay() in the original backtrace which is
invoked when epoll_wait() gets timed out inside epoll_event_loop() makes
me think this situation to be fishy. Adding some details from original
backtrace:

(gdb) f 11
#11 0x00007f1d286dbafa in background_job_waited (subreq=0x56508d0ec8e0)
at ../source3/lib/background.c:179

(gdb) p *state
$1 = {ev = 0x56508d0e2d10, msg = 0x56508d0e3800, trigger_msgs = 0x0,
num_trigger_msgs = 0, parent_longlived = false,
   fn = 0x7f1d286cc420 <mess_parent_dgm_cleanup>, private_data =
0x56508d0e3800, wakeup_req = 0x0, pipe_fd = -1, pipe_req = 0x0}

(gdb) p *req
$2 = {async = {fn = 0x7f1d286cc2d0 <mess_parent_dgm_cleanup_done>,
private_data = 0x56508d0e3800}, data = 0x56508d0ec710,
   private_print = 0x0, private_cancel = 0x0, private_cleanup = {fn =
0x0, state = TEVENT_REQ_INIT}, internal = {
     private_type = 0x7f1d286f6457 "struct background_job_state",
create_location = 0x7f1d286f63b8 "../source3/lib/background.c:61",
     finish_location = 0x0, cancel_location = 0x0, state =
TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x56508d0ec660,
     defer_callback_ev = 0x0, timer = 0x0}}

Somehow I thought of a possible race condition but without much context
around epoll and messaging initialization I cannot prove. Of course I am
not an expert... just sharing my findings. I don't know whether it is
worth for a second thought on the issue. Please correct me if I
completely diverted in my debugging process.

Anoop C S

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: smbd crash in a CTDB cluster

Samba - samba-technical mailing list
On Sat, Aug 12, 2017 at 10:24 AM,  <[hidden email]> wrote:

> On 2017-08-12 03:35, Richard Sharpe wrote:
>>
>> On Fri, Aug 11, 2017 at 7:28 PM,  <[hidden email]> wrote:
>>>
>>> On 2017-08-11 22:47, Richard Sharpe wrote:
>>>>
>>>>
>>>> On Fri, Aug 11, 2017 at 10:40 AM, Anoop C S via samba-technical
>>>> <[hidden email]> wrote:
>>>>>
>>>>>
>>>>> +samba-technical
>>>>>
>>>>> On 2017-08-11 22:44, [hidden email] wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> In a 4 node Samba(v4.6.3) CTDB cluster(with 4 public IPs), smbd
>>>>>> crashes were seen with the following back trace:
>>>>>>
>>>>>> Core was generated by `/usr/sbin/smbd'.
>>>>>> Program terminated with signal 6, Aborted.
>>>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>>>> (gdb) bt
>>>>>> #0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
>>>>>> #1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
>>>>>> #2  0x00007f1d286d04de in dump_core () at
>>>>>> ../source3/lib/dumpcore.c:338
>>>>>> #3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at
>>>>>> ../source3/lib/util.c:814
>>>>>> #4  0x00007f1d2a79c95f in smb_panic (why=why@entry=0x7f1d2a7e482a
>>>>>> "internal error") at ../lib/util/fault.c:166
>>>>>> #5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at
>>>>>> ../lib/util/fault.c:83
>>>>>> #6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
>>>>>> #7  <signal handler called>
>>>>>> #8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800,
>>>>>> mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
>>>>>>     at ../source3/lib/messages_ctdbd.c:278
>>>>>> #9  0x00007f1d286ccd40 in messaging_reinit
>>>>>> (msg_ctx=msg_ctx@entry=0x56508d0e3800) at
>>>>>> ../source3/lib/messages.c:415
>>>>>> #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800,
>>>>>> ev_ctx=<optimized out>,
>>>>>>     parent_longlived=parent_longlived@entry=true,
>>>>>> comment=comment@entry=0x0) at ../source3/lib/util.c:475
>>>>>> #11 0x00007f1d286dbafa in background_job_waited
>>>>>> (subreq=0x56508d0ec8e0) at ../source3/lib/background.c:179
>>>>>> #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay
>>>>>> (ev=0x56508d0e2d10) at ../tevent_timed.c:369
>>>>>> #13 0x00007f1d270e2f49 in epoll_event_loop (tvalp=0x7fffa1f7ca70,
>>>>>> epoll_ev=0x56508d0e2f90) at ../tevent_epoll.c:659
>>>>>> #14 epoll_event_loop_once (ev=<optimized out>, location=<optimized
>>>>>> out>) at ../tevent_epoll.c:930
>>>>>> #15 0x00007f1d270e12a7 in std_event_loop_once (ev=0x56508d0e2d10,
>>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>>     at ../tevent_standard.c:114
>>>>>> #16 0x00007f1d270dd0cd in _tevent_loop_once
>>>>>> (ev=ev@entry=0x56508d0e2d10,
>>>>>>     location=location@entry=0x56508bde85d9
>>>>>> "../source3/smbd/server.c:1384") at ../tevent.c:721
>>>>>> #17 0x00007f1d270dd2fb in tevent_common_loop_wait (ev=0x56508d0e2d10,
>>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>>     at ../tevent.c:844
>>>>>> #18 0x00007f1d270e1247 in std_event_loop_wait (ev=0x56508d0e2d10,
>>>>>> location=0x56508bde85d9 "../source3/smbd/server.c:1384")
>>>>>>     at ../tevent_standard.c:145
>>>>>> #19 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>,
>>>>>> ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
>>>>>> #20 main (argc=<optimized out>, argv=<optimized out>) at
>>>>>> ../source3/smbd/server.c:2038
>>>>
>>>>
>>>>
>>>> This is quite normal if the node was banned when the smbd was forked.
>>>> What does the ctdb log show? What was happening at that time?
>>>
>>>
>>>
>>> I think the logs got rotated and got cleaned up subsequently over time. I
>>> could barely remember that the cluster was not in healthy state at
>>> initial
>>> stage due to some network issue. In fact I am not sure whether node/nodes
>>> were in BANNED state or not. I will try to dig that up and confirm your
>>> analysis.
>>>
>>> Does that mean it is a deliberate panic from smbd? I am asking this
>>> because
>>> of the code re-factoring done in this area which introduces
>>> talloc_get_type_abort() from 4.5 onwards.
>>
>>
>> Samba will panic in reinit_after_fork if anything fails. There are
>> very important things going on there, including connecting to ctdb.
>
>
> I have seen such panics in situations where smbd is unable to connect to
> ctdb when non-public IPs were used for client connection and corresponding
> node is in DISCONNECTED/BANNED/STOPPED state. Following is one of that kind:
>
> (gdb) bt
> #0  0x00007fdbcf38d1f7 in raise () from /lib64/libc.so.6
> #1  0x00007fdbcf38e8e8 in abort () from /lib64/libc.so.6
> #2  0x00007fdbd0d134de in dump_core () at ../source3/lib/dumpcore.c:338
> #3  0x00007fdbd0d046e7 in smb_panic_s3 (why=<optimized out>) at
> ../source3/lib/util.c:814
> #4  0x00007fdbd2ddf95f in smb_panic (why=why@entry=0x55f401ef8a50
> "reinit_after_fork() failed") at ../lib/util/fault.c:166
> #5  0x000055f401ef7af8 in smbd_accept_connection (ev=0x55f4023836c0,
> fde=<optimized out>, flags=<optimized out>,
>     private_data=<optimized out>) at ../source3/smbd/server.c:1014
> #6  0x00007fdbcf725edb in epoll_event_loop (tvalp=0x7fff5b14c720,
> epoll_ev=0x55f402383940) at ../tevent_epoll.c:728
> #7  epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at
> ../tevent_epoll.c:930
> #8  0x00007fdbcf7242a7 in std_event_loop_once (ev=0x55f4023836c0,
> location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
>     at ../tevent_standard.c:114
> #9  0x00007fdbcf7200cd in _tevent_loop_once (ev=ev@entry=0x55f4023836c0,
>     location=location@entry=0x55f401efb5d9 "../source3/smbd/server.c:1384")
> at ../tevent.c:721
> #10 0x00007fdbcf7202fb in tevent_common_loop_wait (ev=0x55f4023836c0,
> location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
>     at ../tevent.c:844
> #11 0x00007fdbcf724247 in std_event_loop_wait (ev=0x55f4023836c0,
> location=0x55f401efb5d9 "../source3/smbd/server.c:1384")
>     at ../tevent_standard.c:145
> #12 0x000055f401ef2a95 in smbd_parent_loop (parent=<optimized out>,
> ev_ctx=0x55f4023836c0) at ../source3/smbd/server.c:1384
> #13 main (argc=<optimized out>, argv=<optimized out>) at
> ../source3/smbd/server.c:2038
>
> The above backtrace is different from what I have observed right now as the
> smb_panic message says its an "internal error" rather than
> "reinit_after_fork() failed". Moreover the presence of
> tevent_common_loop_timer_delay() in the original backtrace which is invoked
> when epoll_wait() gets timed out inside epoll_event_loop() makes me think
> this situation to be fishy. Adding some details from original backtrace:
>
> (gdb) f 11
> #11 0x00007f1d286dbafa in background_job_waited (subreq=0x56508d0ec8e0) at
> ../source3/lib/background.c:179
>
> (gdb) p *state
> $1 = {ev = 0x56508d0e2d10, msg = 0x56508d0e3800, trigger_msgs = 0x0,
> num_trigger_msgs = 0, parent_longlived = false,
>   fn = 0x7f1d286cc420 <mess_parent_dgm_cleanup>, private_data =
> 0x56508d0e3800, wakeup_req = 0x0, pipe_fd = -1, pipe_req = 0x0}
>
> (gdb) p *req
> $2 = {async = {fn = 0x7f1d286cc2d0 <mess_parent_dgm_cleanup_done>,
> private_data = 0x56508d0e3800}, data = 0x56508d0ec710,
>   private_print = 0x0, private_cancel = 0x0, private_cleanup = {fn = 0x0,
> state = TEVENT_REQ_INIT}, internal = {
>     private_type = 0x7f1d286f6457 "struct background_job_state",
> create_location = 0x7f1d286f63b8 "../source3/lib/background.c:61",
>     finish_location = 0x0, cancel_location = 0x0, state =
> TEVENT_REQ_IN_PROGRESS, error = 0, trigger = 0x56508d0ec660,
>     defer_callback_ev = 0x0, timer = 0x0}}
>
> Somehow I thought of a possible race condition but without much context
> around epoll and messaging initialization I cannot prove. Of course I am not
> an expert... just sharing my findings. I don't know whether it is worth for
> a second thought on the issue. Please correct me if I completely diverted in
> my debugging process.

OK, you are correct. This is different.

--
Regards,
Richard Sharpe
(何以解憂?唯有杜康。--曹操)

Loading...