Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance)

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance)

Samba - samba-technical mailing list
G'Day Metze,

Attached is a patch series to bump up the ldb and tdb version numbers
with important fixes for performance, resulting from the locking work
discussed in the other thread, and the performance improvements found
by using Samba's binary search over the list of attributes with an
index.

You will no doubt want to carefully consider the tdb patches along with
the rest of the team, and asked to be the one to push the ldb changes
so you could tag the release.

This passes a private autobuild.

Please let me know if there is anything else I can do to assist.

Thanks,

Andrew Bartlett

ldb-1.1.30-for-metze.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance)

Samba - samba-technical mailing list
Hi Andrew,

> Attached is a patch series to bump up the ldb and tdb version numbers
> with important fixes for performance, resulting from the locking work
> discussed in the other thread, and the performance improvements found
> by using Samba's binary search over the list of attributes with an
> index.
>
> You will no doubt want to carefully consider the tdb patches along with
> the rest of the team, and asked to be the one to push the ldb changes
> so you could tag the release.
>
> This passes a private autobuild.
>
> Please let me know if there is anything else I can do to assist.
Please retry again with --stdout :-)

metze


signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance)

Samba - samba-technical mailing list
On Mon, 2017-04-03 at 13:04 +0200, Stefan Metzmacher wrote:

> Hi Andrew,
>
> > Attached is a patch series to bump up the ldb and tdb version
> > numbers
> > with important fixes for performance, resulting from the locking
> > work
> > discussed in the other thread, and the performance improvements
> > found
> > by using Samba's binary search over the list of attributes with an
> > index.
> >
> > You will no doubt want to carefully consider the tdb patches along
> > with
> > the rest of the team, and asked to be the one to push the ldb
> > changes
> > so you could tag the release.
> >
> > This passes a private autobuild.
> >
> > Please let me know if there is anything else I can do to assist.
>
> Please retry again with --stdout :-)
:-)

http://git.catalyst.net.nz/gw?p=samba.git;a=shortlog;h=refs/heads/ldb-1
.1.30-for-metze

Thanks,

Andrew Bartlett

--
Andrew Bartlett                       http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba

ldb-1.1.30-for-metze.patch (81K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH] Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance, make multi-process)

Samba - samba-technical mailing list
On Mon, 2017-04-03 at 23:48 +1200, Andrew Bartlett via samba-technical
wrote:

> On Mon, 2017-04-03 at 13:04 +0200, Stefan Metzmacher wrote:
> > Hi Andrew,
> >
> > > Attached is a patch series to bump up the ldb and tdb version
> > > numbers
> > > with important fixes for performance, resulting from the locking
> > > work
> > > discussed in the other thread, and the performance improvements
> > > found
> > > by using Samba's binary search over the list of attributes with
> > > an
> > > index.
> > >
> > > You will no doubt want to carefully consider the tdb patches
> > > along
> > > with
> > > the rest of the team, and asked to be the one to push the ldb
> > > changes
> > > so you could tag the release.
> > >
> > > This passes a private autobuild.
> > >
> > > Please let me know if there is anything else I can do to assist.
> >
> > Please retry again with --stdout :-)
>
> :-)
>
> http://git.catalyst.net.nz/gw?p=samba.git;a=shortlog;h=refs/heads/ldb
> -1
> .1.30-for-metze
Attached is the whole set of patches I'm currently trying to land, as
well as just the set discussed.  I've improved the whitespace.

The end purpose I'm trying to end up with is the AD DC set to use
multiple processes for LDAP, and fixing up the locking issues, so that
is more practical.  I've written tests to show where becoming multi-
process costs, and where it helps, and I'm happy with the trade-off.  

The attached graph illustrates it well.  (Run on my workstation, so
fairly noisy.  I may be able to get better graphs later).

For the multi-process, the worst case is when one client connects,
binds, searches for one thing and drops the socket, but this is a
pretty un-realistic workload, and once we get multiple clients this
change helps.

Thanks,

Andrew Bartlett
--
Andrew Bartlett
https://samba.org/~abartlet/
Authentication Developer, Samba Team         https://samba.org
Samba Development and Support, Catalyst IT  
https://catalyst.net.nz/services/samba




ldb-1.1.30-for-metze.patch (79K) Download Attachment
multi-process-ldap-server.patch (99K) Download Attachment
multi-process-ldap-server.pdf (90K) Download Attachment
multi-process-ldap-server-abs.pdf (88K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH] Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance, make multi-process)

Samba - samba-technical mailing list
On Wed, 2017-04-05 at 12:19 +1200, Andrew Bartlett wrote:
>
> Attached is the whole set of patches I'm currently trying to land, as
> well as just the set discussed.  I've improved the whitespace.

Attached is an updated patch of the whole series, to avoid confusion.

Thanks,

Andrew Bartlett
--
Andrew Bartlett
https://samba.org/~abartlet/
Authentication Developer, Samba Team         https://samba.org
Samba Development and Support, Catalyst IT  
https://catalyst.net.nz/services/samba




ldap-multi-process.patch (107K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH] Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance, make multi-process)

Samba - samba-technical mailing list
Am 05.04.2017 um 07:06 schrieb Andrew Bartlett via samba-technical:
> On Wed, 2017-04-05 at 12:19 +1200, Andrew Bartlett wrote:
>>
>> Attached is the whole set of patches I'm currently trying to land, as
>> well as just the set discussed.  I've improved the whitespace.
>
> Attached is an updated patch of the whole series, to avoid confusion.

Here's an update on current master.

I've added the run-fcntl-deadlock as test.

I still want to run the standalone make test of tdb and ldb on solaris,
before we push this and have a closer look at the ldb changes.

metze

tmp.diff.txt (98K) Download Attachment
signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH] Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance, make multi-process)

Samba - samba-technical mailing list
On Tue, 2017-04-11 at 18:08 +0200, Stefan Metzmacher wrote:
>
> Here's an update on current master.
>
> I've added the run-fcntl-deadlock as test.
>
> I still want to run the standalone make test of tdb and ldb on
> solaris,
> before we push this and have a closer look at the ldb changes.

Thanks!

Finally, this is may also be a 'data corruption' issue:  

My theory is that because there is no read lock held, the index might
refer to different records compared with the data.  Sometimes that will
fall back to a traverse (we do that on errors), but sometimes that will
mean we don't return all the records, as we essentially check twice,
both in the index and in the final search filter.  

If a search took a long time, a modify could happen between those two
points, and while the modify would be atomic, the search would not be
atomic.

On the flip side, with this patch reads will block all writes for
longer.

BTW, A test for the LDB changes is in the cmocka thread.  Because LDB
is buggy it passes right now, but fails if we do have the ldb fix but
don't have the tdb fix.

Andrew Bartlett

--
Andrew Bartlett                       http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH] Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance, make multi-process)

Samba - samba-technical mailing list
Hi Andrew,

>> Here's an update on current master.
>>
>> I've added the run-fcntl-deadlock as test.
>>
>> I still want to run the standalone make test of tdb and ldb on
>> solaris,
>> before we push this and have a closer look at the ldb changes.
>
> Thanks!
>
> Finally, this is may also be a 'data corruption' issue:  
>
> My theory is that because there is no read lock held, the index might
> refer to different records compared with the data.  Sometimes that will
> fall back to a traverse (we do that on errors), but sometimes that will
> mean we don't return all the records, as we essentially check twice,
> both in the index and in the final search filter.  
>
> If a search took a long time, a modify could happen between those two
> points, and while the modify would be atomic, the search would not be
> atomic.
I don't understand what you're trying to say...

Do you mean we need to call ltdb_lock_read() in a wider window?
E.g. when starting the ldb_search() from the top level module stack
until end of that search? Currently only searches are only atomic
within ltdb_search(), but not ldb_search().

> On the flip side, with this patch reads will block all writes for
> longer.

And here...?

> BTW, A test for the LDB changes is in the cmocka thread.  Because LDB
> is buggy it passes right now, but fails if we do have the ldb fix but
> don't have the tdb fix.

And here...

Be explicit please and refer to specific commits and tests.

Thanks!
metze


signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH] Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance, make multi-process)

Samba - samba-technical mailing list
On Tue, 2017-04-11 at 21:40 +0200, Stefan Metzmacher wrote:

> Hi Andrew,
>
> > > Here's an update on current master.
> > >
> > > I've added the run-fcntl-deadlock as test.
> > >
> > > I still want to run the standalone make test of tdb and ldb on
> > > solaris,
> > > before we push this and have a closer look at the ldb changes.
> >
> > Thanks!
> >
> > Finally, this is may also be a 'data corruption' issue:  
> >
> > My theory is that because there is no read lock held, the index
> > might
> > refer to different records compared with the data.  Sometimes that
> > will
> > fall back to a traverse (we do that on errors), but sometimes that
> > will
> > mean we don't return all the records, as we essentially check
> > twice,
> > both in the index and in the final search filter.  
> >
> > If a search took a long time, a modify could happen between those
> > two
> > points, and while the modify would be atomic, the search would not
> > be
> > atomic.
>
> I don't understand what you're trying to say...
>
> Do you mean we need to call ltdb_lock_read() in a wider window?
> E.g. when starting the ldb_search() from the top level module stack
> until end of that search? Currently only searches are only atomic
> within ltdb_search(), but not ldb_search().

They are not even that atomic!  The bug that started all this was that
ltdb_search() does not actually hold a lock (after the first read).  

I've not considered the implications further up the stack at this
point...

> > On the flip side, with this patch reads will block all writes for
> > longer.
>
> And here...?

If we actually hold a lock in ltdb_search(), then we will block writes
more often.  That is, what we fix in '[PATCH 07/17] ldb_tdb: Ensure we
correctly decrement ltdb->read_lock_count'.

> > BTW, A test for the LDB changes is in the cmocka thread.  Because
> > LDB
> > is buggy it passes right now, but fails if we do have the ldb fix
> > but
> > don't have the tdb fix. 
>
> And here...
>
> Be explicit please and refer to specific commits and tests.

https://lists.samba.org/archive/samba-technical/2017-April/119857.html
has
http://lists.samba.org/pipermail/samba-technical/attachments/20170407/4
a60faea/0002-ldb-Add-test-for-transaction-deadlock-detected-when-.bin

I'll merge the other cmocka tests shortly, then hope to work with
Andreas to get this on merged.  

When we Garming tried to fix the refcounting ([PATCH 07/17] ldb_tdb:
Ensure we correctly decrement ltdb->read_lock_count) and so hold the
lock in ltdb_search(), make test failed with deadlock detected.  This
test re-creates this situation manually, for easy validation.

Thanks,

Andrew Bartlett

--
Andrew Bartlett                       http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Read corruption possible during ldb_search (was: Re: [PATCH] Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance, make multi-process))

Samba - samba-technical mailing list
I would like to make progress on the ldb and tdb locking issue.

I've now included a patch that demonstrates that with ldb_tdb 1.1.29 we
allow writes to interfere with indexed reads, meaning that if a write
happens during a search, the search output is not consistent.

To demonstrate it, run ./bin/ldb_tdb_mod_op_test in a lib/ldb
standalone build with this commit reverted:
"ldb_tdb: Ensure we correctly decrement ltdb->read_lock_count"

Therefore the locking changes are more than just a performance issue,
and I would request that we move forward with it, given the inability
to isolate the possible decade-old locking issue on Solaris.

In the alternate, could you please propose a route forward on this?  

Thanks!

Andrew Bartlett

--
Andrew Bartlett                       http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba

multi-process-and-ldb-1.1.30.patch.txt (159K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH] Re: Proposed ldb 1.1.30 and tdb 1.3.13 (improve AD DC search performance, make multi-process)

Samba - samba-technical mailing list
In reply to this post by Samba - samba-technical mailing list
Am 11.04.2017 um 22:23 schrieb Andrew Bartlett:

> On Tue, 2017-04-11 at 21:40 +0200, Stefan Metzmacher wrote:
>> Hi Andrew,
>>
>>>> Here's an update on current master.
>>>>
>>>> I've added the run-fcntl-deadlock as test.
>>>>
>>>> I still want to run the standalone make test of tdb and ldb on
>>>> solaris,
>>>> before we push this and have a closer look at the ldb changes.
>>>
>>> Thanks!
>>>
>>> Finally, this is may also be a 'data corruption' issue:  
>>>
>>> My theory is that because there is no read lock held, the index
>>> might
>>> refer to different records compared with the data.  Sometimes that
>>> will
>>> fall back to a traverse (we do that on errors), but sometimes that
>>> will
>>> mean we don't return all the records, as we essentially check
>>> twice,
>>> both in the index and in the final search filter.  
>>>
>>> If a search took a long time, a modify could happen between those
>>> two
>>> points, and while the modify would be atomic, the search would not
>>> be
>>> atomic.
>>
>> I don't understand what you're trying to say...
>>
>> Do you mean we need to call ltdb_lock_read() in a wider window?
>> E.g. when starting the ldb_search() from the top level module stack
>> until end of that search? Currently only searches are only atomic
>> within ltdb_search(), but not ldb_search().
>
> They are not even that atomic!  The bug that started all this was that
> ltdb_search() does not actually hold a lock (after the first read).  
>
> I've not considered the implications further up the stack at this
> point...
>
>>> On the flip side, with this patch reads will block all writes for
>>> longer.
>>
>> And here...?
>
> If we actually hold a lock in ltdb_search(), then we will block writes
> more often.  That is, what we fix in '[PATCH 07/17] ldb_tdb: Ensure we
> correctly decrement ltdb->read_lock_count'.

>>> BTW, A test for the LDB changes is in the cmocka thread.  Because
>>> LDB
>>> is buggy it passes right now, but fails if we do have the ldb fix
>>> but
>>> don't have the tdb fix.
>>
>> And here...
>>
>> Be explicit please and refer to specific commits and tests.
>
> https://lists.samba.org/archive/samba-technical/2017-April/119857.html
> has
> http://lists.samba.org/pipermail/samba-technical/attachments/20170407/4
> a60faea/0002-ldb-Add-test-for-transaction-deadlock-detected-when-.bin
>
> I'll merge the other cmocka tests shortly, then hope to work with
> Andreas to get this on merged.  
>
> When we Garming tried to fix the refcounting ([PATCH 07/17] ldb_tdb:
> Ensure we correctly decrement ltdb->read_lock_count) and so hold the
> lock in ltdb_search(), make test failed with deadlock detected.  This
> test re-creates this situation manually, for easy validation.
I'm totally lost sorry...

My understanding of your arguments in this thread is that
you try to tell me that there's still a bug even with the patches applied.

And I don't yet understand why...

metze


signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Read corruption possible during ldb_search

Samba - samba-technical mailing list
In reply to this post by Samba - samba-technical mailing list
Am 25.04.2017 um 13:08 schrieb Andrew Bartlett:

> I would like to make progress on the ldb and tdb locking issue.
>
> I've now included a patch that demonstrates that with ldb_tdb 1.1.29 we
> allow writes to interfere with indexed reads, meaning that if a write
> happens during a search, the search output is not consistent.
>
> To demonstrate it, run ./bin/ldb_tdb_mod_op_test in a lib/ldb
> standalone build with this commit reverted:
> "ldb_tdb: Ensure we correctly decrement ltdb->read_lock_count"
>
> Therefore the locking changes are more than just a performance issue,
> and I would request that we move forward with it, given the inability
> to isolate the possible decade-old locking issue on Solaris.
>
> In the alternate, could you please propose a route forward on this?  
I used this branch
https://git.samba.org/?p=abartlet/samba.git/.git;a=shortlog;h=refs/heads/multi-process-ldap-server
(with this top commit)
https://git.samba.org/?p=abartlet/samba.git/.git;a=commit;h=04f8106c705ed3c7573dfd5db1e41e04a1111769
and combined that with the patches I posted here:
https://lists.samba.org/archive/samba-technical/2017-April/119979.html

The result is attached and available here:
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master4-ldb

So my most important question:
Do you think the patches in my branch fix all of our known deadlock/
data corruption bugs?

All mails in this thread indicated to me that there're more bugs to be
fixed.
But this could also be a misunderstanding.

If there're no more bugs to be fixed and I start with a more deep
review, otherwise please base your additions on top of my branch.

Regarding the Solaris deadlock problem, I guess we can ignore it
as the test we wrote also works on Solaris 10, but I'll run the
full standalone testsuites for ldb and tdb again before it might
be pushed to master.

Thanks!
metze

tmp.diff.txt (180K) Download Attachment
signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Read corruption possible during ldb_search

Samba - samba-technical mailing list
As far as I know, all the known deadlocks and data corruption should be
fixed in this patchset.

What I think Andrew was trying to say was that there was no direct test
of my patch ([PATCH 07/17] ldb_tdb: Ensure we correctly decrement
ltdb->read_lock_count) which fixes the original read consistency issue
we were attempting to resolve. Instead he wrote a cmocka test to show
that separately (prior to that we only detected it through spurious
deadlocks during make test).


Cheers,

Garming

On 26/04/17 00:20, Stefan Metzmacher via samba-technical wrote:

> Am 25.04.2017 um 13:08 schrieb Andrew Bartlett:
>> I would like to make progress on the ldb and tdb locking issue.
>>
>> I've now included a patch that demonstrates that with ldb_tdb 1.1.29 we
>> allow writes to interfere with indexed reads, meaning that if a write
>> happens during a search, the search output is not consistent.
>>
>> To demonstrate it, run ./bin/ldb_tdb_mod_op_test in a lib/ldb
>> standalone build with this commit reverted:
>> "ldb_tdb: Ensure we correctly decrement ltdb->read_lock_count"
>>
>> Therefore the locking changes are more than just a performance issue,
>> and I would request that we move forward with it, given the inability
>> to isolate the possible decade-old locking issue on Solaris.
>>
>> In the alternate, could you please propose a route forward on this?  
> I used this branch
> https://git.samba.org/?p=abartlet/samba.git/.git;a=shortlog;h=refs/heads/multi-process-ldap-server
> (with this top commit)
> https://git.samba.org/?p=abartlet/samba.git/.git;a=commit;h=04f8106c705ed3c7573dfd5db1e41e04a1111769
> and combined that with the patches I posted here:
> https://lists.samba.org/archive/samba-technical/2017-April/119979.html
>
> The result is attached and available here:
> https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master4-ldb
>
> So my most important question:
> Do you think the patches in my branch fix all of our known deadlock/
> data corruption bugs?
>
> All mails in this thread indicated to me that there're more bugs to be
> fixed.
> But this could also be a misunderstanding.
>
> If there're no more bugs to be fixed and I start with a more deep
> review, otherwise please base your additions on top of my branch.
>
> Regarding the Solaris deadlock problem, I guess we can ignore it
> as the test we wrote also works on Solaris 10, but I'll run the
> full standalone testsuites for ldb and tdb again before it might
> be pushed to master.
>
> Thanks!
> metze


signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Read corruption possible during ldb_search

Samba - samba-technical mailing list
On Wed, 2017-04-26 at 10:03 +1200, Garming Sam wrote:

> As far as I know, all the known deadlocks and data corruption should
> be
> fixed in this patchset.
>
> What I think Andrew was trying to say was that there was no direct
> test
> of my patch ([PATCH 07/17] ldb_tdb: Ensure we correctly decrement
> ltdb->read_lock_count) which fixes the original read consistency
> issue
> we were attempting to resolve. Instead he wrote a cmocka test to show
> that separately (prior to that we only detected it through spurious
> deadlocks during make test).

Thanks.  That was the case last night, sadly in more autobuilds I'm
still seeing deadlock detected so I'm continuing to investigate.

ldb: ltdb: tdb(/home/ubuntu/autobuild/b27129/samba/bin/ab/fl2008r2dc/private/sam.ldb): tdb_transaction_prepare_commit: failed to upgrade hash locks: Locking error

ldb: ltdb: tdb(/home/ubuntu/autobuild/b27129/samba/bin/ab/fl2008r2dc/private/sam.ldb): tdb_transaction_cancel: no transaction

ldb: dsdb_set_schema() failed: 51:Busy: Failure during tdb_transaction_prepare_commit(): Locking error -> Busy

I think we have this pattern:

(1) start search (allrecord read lock)
(1) transaction start (transaction read lock)
(2) transaction start (transaction read lock)
(1) traverse (requesting transaction write lock, then write locks)
(2) some operation (likely attempting to write new schema index list to
@INDEXLIST)
(2) transaction prepare commit (requesting all-record lock held by (1))
-> DEADLOCK detected, because the

I'll write another cmocka test when I get a moment to prove it.  I
think the fix is to not request the transaction write lock in the
traverse.

Thanks,

Andrew Bartlett
--
Andrew Bartlett                       http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Read corruption possible during ldb_search

Samba - samba-technical mailing list
Am 26.04.2017 um 05:23 schrieb Andrew Bartlett:

> On Wed, 2017-04-26 at 10:03 +1200, Garming Sam wrote:
>> As far as I know, all the known deadlocks and data corruption should
>> be
>> fixed in this patchset.
>>
>> What I think Andrew was trying to say was that there was no direct
>> test
>> of my patch ([PATCH 07/17] ldb_tdb: Ensure we correctly decrement
>> ltdb->read_lock_count) which fixes the original read consistency
>> issue
>> we were attempting to resolve. Instead he wrote a cmocka test to show
>> that separately (prior to that we only detected it through spurious
>> deadlocks during make test).
>
> Thanks.  That was the case last night, sadly in more autobuilds I'm
> still seeing deadlock detected so I'm continuing to investigate.
>
> ldb: ltdb: tdb(/home/ubuntu/autobuild/b27129/samba/bin/ab/fl2008r2dc/private/sam.ldb): tdb_transaction_prepare_commit: failed to upgrade hash locks: Locking error
>
> ldb: ltdb: tdb(/home/ubuntu/autobuild/b27129/samba/bin/ab/fl2008r2dc/private/sam.ldb): tdb_transaction_cancel: no transaction
>
> ldb: dsdb_set_schema() failed: 51:Busy: Failure during tdb_transaction_prepare_commit(): Locking error -> Busy
I also got this:

[1997(12157)/2099 at 2h5m31s]
samba4.urgent_replication.python(ad_dc_ntvfs)(ad_dc_ntvfs:local)
WARNING: The "lsa over netlogon" option is deprecated
baseDN: DC=samba,DC=example,DC=com

ltdb:
tdb(/memdisk/metze/W/b51706/samba/bin/ab/ad_dc_ntvfs/private/sam.ldb):
tdb_transaction_prepare_commit: failed to upgrade hash locks: Locking error

ltdb:
tdb(/memdisk/metze/W/b51706/samba/bin/ab/ad_dc_ntvfs/private/sam.ldb):
tdb_transaction_cancel: no transaction

dsdb_set_schema() failed: 51:Busy: Failure during
tdb_transaction_prepare_commit(): Locking error -> Busy
UNEXPECTED(error):
samba4.urgent_replication.python(ad_dc_ntvfs).__main__.UrgentReplicationTests.test_attributeSchema_object(ad_dc_ntvfs:local)
REASON: Exception: Exception: Traceback (most recent call last):
  File
"/memdisk/metze/W/b51706/samba/source4/dsdb/tests/python/urgent_replication.py",
line 176, in test_attributeSchema_object
    self.ldb.modify(m)
LdbError: (1, 'ldb_wait from
../source4/dsdb/samdb/ldb_modules/util.c:369 with LDB_WAIT_ALL:
Operations error (1)')

FAILED (0 failures, 1 errors and 0 unexpected successes in 0 testsuites)

> I think we have this pattern:
>
> (1) start search (allrecord read lock)
> (1) transaction start (transaction read lock)

This gets the transaction write lock

> (2) transaction start (transaction read lock)

And this blocks.

> (1) traverse (requesting transaction write lock, then write locks)
> (2) some operation (likely attempting to write new schema index list to
> @INDEXLIST)

Can't happen as transaction_start blocks...

> (2) transaction prepare commit (requesting all-record lock held by (1))
> -> DEADLOCK detected, because the
>
> I'll write another cmocka test when I get a moment to prove it.  I
> think the fix is to not request the transaction write lock in the
> traverse.

While debugging a gencache_stabilize() problem, I also thought about
just getting the transaction read lock during a traverse.

But first we have to understand the above problem.

metze


signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Read corruption possible during ldb_search

Samba - samba-technical mailing list
On Wed, 2017-04-26 at 08:32 +0200, Stefan Metzmacher wrote:

> Am 26.04.2017 um 05:23 schrieb Andrew Bartlett:
> > On Wed, 2017-04-26 at 10:03 +1200, Garming Sam wrote:
> > > As far as I know, all the known deadlocks and data corruption
> > > should
> > > be
> > > fixed in this patchset.
> > >
> > > What I think Andrew was trying to say was that there was no
> > > direct
> > > test
> > > of my patch ([PATCH 07/17] ldb_tdb: Ensure we correctly decrement
> > > ltdb->read_lock_count) which fixes the original read consistency
> > > issue
> > > we were attempting to resolve. Instead he wrote a cmocka test to
> > > show
> > > that separately (prior to that we only detected it through
> > > spurious
> > > deadlocks during make test).
> >
> > Thanks.  That was the case last night, sadly in more autobuilds I'm
> > still seeing deadlock detected so I'm continuing to investigate. 
> >
> > ldb: ltdb:
> > tdb(/home/ubuntu/autobuild/b27129/samba/bin/ab/fl2008r2dc/private/s
> > am.ldb): tdb_transaction_prepare_commit: failed to upgrade hash
> > locks: Locking error
> >
> > ldb: ltdb:
> > tdb(/home/ubuntu/autobuild/b27129/samba/bin/ab/fl2008r2dc/private/s
> > am.ldb): tdb_transaction_cancel: no transaction
> >
> > ldb: dsdb_set_schema() failed: 51:Busy: Failure during
> > tdb_transaction_prepare_commit(): Locking error -> Busy
>
> I also got this:
>
> [1997(12157)/2099 at 2h5m31s]
> samba4.urgent_replication.python(ad_dc_ntvfs)(ad_dc_ntvfs:local)
> WARNING: The "lsa over netlogon" option is deprecated
> baseDN: DC=samba,DC=example,DC=com
>
> ltdb:
> tdb(/memdisk/metze/W/b51706/samba/bin/ab/ad_dc_ntvfs/private/sam.ldb)
> :
> tdb_transaction_prepare_commit: failed to upgrade hash locks: Locking
> error
>
> ltdb:
> tdb(/memdisk/metze/W/b51706/samba/bin/ab/ad_dc_ntvfs/private/sam.ldb)
> :
> tdb_transaction_cancel: no transaction
>
> dsdb_set_schema() failed: 51:Busy: Failure during
> tdb_transaction_prepare_commit(): Locking error -> Busy
> UNEXPECTED(error):
> samba4.urgent_replication.python(ad_dc_ntvfs).__main__.UrgentReplicat
> ionTests.test_attributeSchema_object(ad_dc_ntvfs:local)
> REASON: Exception: Exception: Traceback (most recent call last):
>   File
> "/memdisk/metze/W/b51706/samba/source4/dsdb/tests/python/urgent_repli
> cation.py",
> line 176, in test_attributeSchema_object
>     self.ldb.modify(m)
> LdbError: (1, 'ldb_wait from
> ../source4/dsdb/samdb/ldb_modules/util.c:369 with LDB_WAIT_ALL:
> Operations error (1)')
>
> FAILED (0 failures, 1 errors and 0 unexpected successes in 0
> testsuites)
>
> > I think we have this pattern:
> >
> > (1) start search (allrecord read lock)
> > (1) transaction start (transaction read lock)
>
> This gets the transaction write lock

OK, I'll try again, perhaps:

(1) start transaction (transaction write lock)
(1)                   (all-record read lock)
(2) start search      (all-record read lock)
(1) traverse          (record write lock - blocked)
(2) start transaction (transaction write lock - deadlock)

Clearly I'll need to catch one of the failing autobuilds and check the
lslocks, and/or reproduce in another cmocka test.

The about might happen when the dsdb_get_schema() code writes out a new
@ATTRIBUTES entry.  A workaround might be to only write that out from
the schema_update_now hook and on DB load, not in each and every
process that notices the schema has changed, often during a search.

> But first we have to understand the above problem.

Indeed.

Andrew Bartlett
--
Andrew Bartlett                       http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Read corruption possible during ldb_search

Samba - samba-technical mailing list
Am 28.04.2017 um 06:15 schrieb Andrew Bartlett:

> On Wed, 2017-04-26 at 08:32 +0200, Stefan Metzmacher wrote:
>> Am 26.04.2017 um 05:23 schrieb Andrew Bartlett:
>>> On Wed, 2017-04-26 at 10:03 +1200, Garming Sam wrote:
>>>> As far as I know, all the known deadlocks and data corruption
>>>> should
>>>> be
>>>> fixed in this patchset.
>>>>
>>>> What I think Andrew was trying to say was that there was no
>>>> direct
>>>> test
>>>> of my patch ([PATCH 07/17] ldb_tdb: Ensure we correctly decrement
>>>> ltdb->read_lock_count) which fixes the original read consistency
>>>> issue
>>>> we were attempting to resolve. Instead he wrote a cmocka test to
>>>> show
>>>> that separately (prior to that we only detected it through
>>>> spurious
>>>> deadlocks during make test).
>>>
>>> Thanks.  That was the case last night, sadly in more autobuilds I'm
>>> still seeing deadlock detected so I'm continuing to investigate.
>>>
>>> ldb: ltdb:
>>> tdb(/home/ubuntu/autobuild/b27129/samba/bin/ab/fl2008r2dc/private/s
>>> am.ldb): tdb_transaction_prepare_commit: failed to upgrade hash
>>> locks: Locking error
>>>
>>> ldb: ltdb:
>>> tdb(/home/ubuntu/autobuild/b27129/samba/bin/ab/fl2008r2dc/private/s
>>> am.ldb): tdb_transaction_cancel: no transaction
>>>
>>> ldb: dsdb_set_schema() failed: 51:Busy: Failure during
>>> tdb_transaction_prepare_commit(): Locking error -> Busy
>>
>> I also got this:
>>
>> [1997(12157)/2099 at 2h5m31s]
>> samba4.urgent_replication.python(ad_dc_ntvfs)(ad_dc_ntvfs:local)
>> WARNING: The "lsa over netlogon" option is deprecated
>> baseDN: DC=samba,DC=example,DC=com
>>
>> ltdb:
>> tdb(/memdisk/metze/W/b51706/samba/bin/ab/ad_dc_ntvfs/private/sam.ldb)
>> :
>> tdb_transaction_prepare_commit: failed to upgrade hash locks: Locking
>> error
>>
>> ltdb:
>> tdb(/memdisk/metze/W/b51706/samba/bin/ab/ad_dc_ntvfs/private/sam.ldb)
>> :
>> tdb_transaction_cancel: no transaction
>>
>> dsdb_set_schema() failed: 51:Busy: Failure during
>> tdb_transaction_prepare_commit(): Locking error -> Busy
>> UNEXPECTED(error):
>> samba4.urgent_replication.python(ad_dc_ntvfs).__main__.UrgentReplicat
>> ionTests.test_attributeSchema_object(ad_dc_ntvfs:local)
>> REASON: Exception: Exception: Traceback (most recent call last):
>>   File
>> "/memdisk/metze/W/b51706/samba/source4/dsdb/tests/python/urgent_repli
>> cation.py",
>> line 176, in test_attributeSchema_object
>>     self.ldb.modify(m)
>> LdbError: (1, 'ldb_wait from
>> ../source4/dsdb/samdb/ldb_modules/util.c:369 with LDB_WAIT_ALL:
>> Operations error (1)')
>>
>> FAILED (0 failures, 1 errors and 0 unexpected successes in 0
>> testsuites)
>>
>>> I think we have this pattern:
>>>
>>> (1) start search (allrecord read lock)
>>> (1) transaction start (transaction read lock)
>>
>> This gets the transaction write lock
>
> OK, I'll try again, perhaps:
>
> (1) start transaction (transaction write lock)
> (1)                   (all-record read lock)
> (2) start search      (all-record read lock)
> (1) traverse          (record write lock - blocked)
A traverse never gets record write locks
only tdb_do_delete() uses tdb_write_lock_record().

tdb_[un]lock_record() are a noop during a transaction
because we have the allrecord lock.

tdb_write_lock_record() will always fail during a transaction
because we only have the allrecord read lock.

> (2) start transaction (transaction write lock - deadlock)

This would already fail the tdb_have_extra_locks() check,
before trying the transaction write lock.

> Clearly I'll need to catch one of the failing autobuilds and check the
> lslocks, and/or reproduce in another cmocka test.

Remember the bug we're after happens during prepare_commit(),
I'm wondering if 3 processes need to be involved for the bug to happen.

> The about might happen when the dsdb_get_schema() code writes out a new
> @ATTRIBUTES entry.  A workaround might be to only write that out from
> the schema_update_now hook and on DB load, not in each and every
> process that notices the schema has changed, often during a search.

Yes, that would be better I guess, or we also store the value from
schemaInfo into a @SCHEMAINFO object, so that we can skip flushing
the loaded schema if it's already the same.

I guess dsdb_schema_set_indices_and_attributes()
would be the place to check that.
dsdb_schema_set_indices_and_attributes() should also
use a transaction for the update in order to consistently
update all objects at the same time.

And even if this schema change helps we need to fix the original
bug...

metze


signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Read corruption possible during ldb_search

Samba - samba-technical mailing list
On Fri, 2017-04-28 at 10:46 +0200, Stefan Metzmacher wrote:
>
> And even if this schema change helps we need to fix the original
> bug...

Indeed.  I guess I better get it under lslocks.

Andrew Bartlett

--
Andrew Bartlett                       http://samba.org/~abartlet/
Authentication Developer, Samba Team  http://samba.org
Samba Developer, Catalyst IT          http://catalyst.net.nz/services/samba


Loading...