Bugzilla

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P2

Rosanna

Comment 4

•

2 years ago

hi jstutte. This is currently blocking a test suite. Is there someone that can work on this fix?

Thanks!

Flags: needinfo?(jstutte)

Comment 5

•

2 years ago

•

https://treeherder.mozilla.org/jobs?repo=try&revision=f680a5144a74c177c580601945c79f92523a5fc0

Investigating:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=77f69a5d6f10e892d9ffebf1a215c829c95027fe

Comment 6

•

2 years ago

Hi Michelle, it seems I am not able to reproduce the failures? Am I overlooking something obvious on how to start the tests?

Flags: needinfo?(jstutte) → needinfo?(mgoossens)

Reporter

Comment 7

•

2 years ago

The commit you used already contains a lot of disabled tests. Try using https://hg.mozilla.org/try/rev/87e931bd817ffb6b32b94b1d9dad752f2579a028 instead, this shouldn't contain any additional disabled tests.

Flags: needinfo?(mgoossens)

Comment 8

•

2 years ago

•

(In reply to Michelle Goossens [:masterwayz] from comment #7)

The commit you used already contains a lot of disabled tests. Try using https://hg.mozilla.org/try/rev/87e931bd817ffb6b32b94b1d9dad752f2579a028 instead, this shouldn't contain any additional disabled tests.

I enabled the IndexedDB tests, AFAICS? Isn't that enough? Can you confirm that I triggered a run on GCP with my patch or not?

Flags: needinfo?(mgoossens)

Reporter

Comment 9

•

2 years ago

Odd, just made a new push to see what happened, rebased on latest central.
Will update the findings here, leaving the NI in place.

Comment 10

•

2 years ago

Thanks!

Comment 11

•

2 years ago

(FWIW, I pushed also another try just enabling tests without other changes: https://treeherder.mozilla.org/jobs?repo=try&revision=e7210217a9f9b37f149a995f3eb943773dba9deb - I'll get back on Monday here)

Reporter

Comment 12

•

2 years ago

I see them here, so not fixed on its own: https://treeherder.mozilla.org/jobs?repo=try&revision=ebb4f089edead4b945ec933d5e252f619eceb396

Flags: needinfo?(mgoossens) → needinfo?(jstutte)

Comment 13

•

2 years ago

•

Yeah, also my latest push shows the error. The only difference to my other test is some more logging. There most be some fragile race here?

Flags: needinfo?(jstutte)

Reporter

Comment 14

•

2 years ago

Maybe a race condition? I have no clue but that is very odd that just adding logging fixes it.

Comment 15

•

2 years ago

(In reply to Michelle Goossens [:masterwayz] from comment #14)

Maybe a race condition? I have no clue but that is very odd that just adding logging fixes it.

Likely a race, as I retriggered the test with logs and one of the runs shows the failure, too.

Comment 16

•

2 years ago

•

After two days of printf debugging and try pushes I think I have a trail now. Please note that the order of debug messages is confusing: The printf_stderr messages are printed earlier than the Javascript info messages. Reproposing the log snippet here in the assumed order of events:

[task 2022-06-27T13:40:41.613Z] 13:40:41     INFO - TEST-START | dom/indexedDB/test/test_object_identity.html
...
[task 2022-06-27T13:46:10.535Z] 13:46:10     INFO - Clearing old databases
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - clearAllDatabases then nextTestHarnessStep
> [task 2022-06-27T13:40:41.672Z] 13:40:41     INFO - GECKO(10730) | State_Initial (numOps:1)
> [task 2022-06-27T13:40:41.674Z] 13:40:41     INFO - GECKO(10730) | State_DirectoryWorkOpen (numOps:1)
> [task 2022-06-27T13:40:41.675Z] 13:40:41     INFO - GECKO(10730) | State_UnblockingOpen (numOps:1)
> [task 2022-06-27T13:40:41.676Z] 13:40:41     INFO - GECKO(10730) | QuotaRequestBase::SendResults
> [task 2022-06-27T13:40:41.677Z] 13:40:41     INFO - GECKO(10730) | ClearStoragesForPrincipal request initiated
> [task 2022-06-27T13:40:41.679Z] 13:40:41     INFO - GECKO(10730) | Request::FireCallback fired.
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - nextTestHarnessStep done: false
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - Running test in worker
[task 2022-06-27T13:46:10.536Z] 13:46:10     INFO - Running test in worker as GeneratorFunction
[task 2022-06-27T13:46:10.537Z] 13:46:10     INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Unsupported function type 
[task 2022-06-27T13:46:10.537Z] 13:46:10     INFO - executeWorkerTestAndCleanUp for http://mochi.test:8888/tests/dom/indexedDB/test/unit/test_object_identity.js
[task 2022-06-27T13:46:10.537Z] 13:46:10     INFO - nextTestHarnessStep done: false
[task 2022-06-27T13:46:10.538Z] 13:46:10     INFO - Worker: loading ["http://mochi.test:8888/tests/dom/indexedDB/test/unit/test_object_identity.js"]
[task 2022-06-27T13:46:10.538Z] 13:46:10     INFO - Worker: starting tests
[task 2022-06-27T13:46:10.538Z] 13:46:10     INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Unsupported function type: undefined 
... (some more tests passing) ...
[task 2022-06-27T13:46:10.543Z] 13:46:10     INFO - TEST-PASS | dom/indexedDB/test/test_object_identity.html | Worker finished 
[task 2022-06-27T13:46:10.544Z] 13:46:10     INFO - Terminating the worker
[task 2022-06-27T13:46:10.544Z] 13:46:10     INFO - Cleaning up the databases
[task 2022-06-27T13:46:10.545Z] 13:46:10     INFO - clearAllDatabases then 
> [task 2022-06-27T13:40:41.726Z] 13:40:41     INFO - GECKO(10730) | State_Initial (numOps:1)
> [task 2022-06-27T13:40:41.726Z] 13:40:41     INFO - GECKO(10730) | DirectoryLockImpl::Overlaps -> true
> [task 2022-06-27T13:40:41.727Z] 13:40:41     INFO - GECKO(10730) | ClearStoragesForPrincipal request initiated
[task 2022-06-27T13:46:10.545Z] 13:46:10     INFO - Buffered messages finished
[task 2022-06-27T13:46:10.547Z] 13:46:10     INFO - TEST-UNEXPECTED-FAIL | dom/indexedDB/test/test_object_identity.html | Test timed out. -

At the second call to clearAllDatabases we end up initializing the ClearOrigin operation, finding another directory lock active (DirectoryLockImpl::Overlaps -> true). But according to State_Initial (numOps:1) we are the only existing operation, so nothing will ever unlock us again and we are stuck.

So something breaks our lock bookkeeping, and it can well be some remainder from a previous test run for the same origin. Assuming that the bookkeeping is generally well tested, and given the apparent race, my best bet would be that we need to have a mutex that guards all the locklist manipulations and lookups in DirectoryLockImpl to ensure that we always have a consistent state of those lists.

Jan, wdyt?

Flags: needinfo?(jvarga)

Jan Varga [:janv]

Assignee

Comment 17

•

2 years ago

Jens, thanks a lot for the info, I'll look what's wrong.

Comment 18

•

2 years ago

Just summarizing our conversation on slack:

a Mutex is not supposed to help, all DirectoryLock accesses run on on the same thread (and we trust the relative asserts)
there can be other calls to Acquire that are not coming from operations

We should still add more info to the logs for better understanding, Jan is looking into this.

Jan Varga [:janv]

Assignee

Updated

•

2 years ago

Depends on: 1777914