Closed Bug 1341621 Opened 8 years ago Closed 7 years ago

Running mochitests with --dmd can cause hang on startup

Categories

(Core :: DMD, defect)

Other Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: kats, Assigned: erahm)

References

Details

Attachments

(4 files)

On a m-c build with the patch from bug 1341450 applied, run:
  DMD="--mode=live" ./mach mochitest --subsuite webgl --dmd
(for completeness, I was running the mach command inside xvfb-run, but I don't think that should make a difference)

This should run the mochitests with DMD enabled (i.e. libdmd.so preloaded) with DMD running in "live mode". And it does. However, it runs into a problem: When running the webgl subsuite, it sometimes hangs on start. Running `ps ef` shows a process tree where firefox has just spawned another firefox process, and when I attach to that child process I get the attached backtrace.

It seems like maybe the firefox process is forking while some thread is holding to a DMD mutex, and that results in a deadlock? I'm not really sure. It's intermittent but happens > 50% of the time for me.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #0)
> It seems like maybe the firefox process is forking while some thread is
> holding to a DMD mutex, and that results in a deadlock? I'm not really sure.
> It's intermittent but happens > 50% of the time for me.

I'd be really surprised we're just straight up forking here, as NUWA showed that's hard to do right. Kats looked a bit further and it appears to be fork/execve [1] so that hopefully isn't the issue.

Looking at the code we're not doing any error checking [2], adding some assertions there might help. My guess is more of a standard deadlock (maybe doing a measurement while allocating memory in another thread), looking at the other threads might give a clue. I'll see if I can repro locally.

[1] http://searchfox.org/mozilla-central/rev/39e4b25a076c59d2e7820297d62319f167871449/ipc/chromium/src/base/process_util_linux.cc#177-199
[2] http://searchfox.org/mozilla-central/rev/39e4b25a076c59d2e7820297d62319f167871449/memory/replace/dmd/DMD.cpp#453-456
I can't repro this on my machine (Ubuntu 16.04, clang 3.9.0). kats noted he's using |--enable-webrender| so it may be specific to that. I tested with both live mode and dark matter mode.

kats, a couple of questions:
#1 - Does this repro w/o |--enable-webrender|?
#2 - Can you look at other threads when the deadlock happens? I'm interested in any other thread with a |bt| that runs through DMD.cpp
Flags: needinfo?(bugmail)
Attached file Console session
(In reply to Eric Rahm [:erahm] from comment #2)
> kats, a couple of questions:
> #1 - Does this repro w/o |--enable-webrender|?

I'll check this tomorrow (leaving needinfo on me for this).

> #2 - Can you look at other threads when the deadlock happens? I'm interested
> in any other thread with a |bt| that runs through DMD.cpp

With my current setup, when this happens, there are no other threads. At least no that gdb can find. See attached console session.
I guess this behaviour might be explained if we fork() while a different thread is holding the DMD lock. The child of the fork will basically inherit the locked object and it will never be unlocked, per [1]. But I would expect the child process stack to look different, not with JS and gtk stuff on the stack. I don't know what that's about.

[1] http://stackoverflow.com/a/1235578
I'm doing a DMD build without webrender now, but I did some more debugging on the webrender-enabled build. I ran the mochitest command with --debugger=gdb and set a breakpoint on fork. The first fork call was at [1]. The second one came from inside [2]. The third one, though, had the same stack as in the previous attachments. Therefore we can conclude that fork call is the one that spins up the deadlocking child process. I also did a `thread apply all bt` to see where the other threads were at this (third) fork call - see attachment. I didn't catch any of them inside the DMD code on that instance but there were 26 threads so it's not surprising that sometimes one of them might be in the middle of an allocation and holding on to the DMD lock when this fork runs.

[1] http://searchfox.org/mozilla-central/rev/b1044cf7c2000c3e75e8181e893236a940c8b6d2/toolkit/xre/glxtest.cpp#313
[2] http://searchfox.org/mozilla-central/rev/b1044cf7c2000c3e75e8181e893236a940c8b6d2/widget/gtk/WakeLockListener.cpp#315
And yes, I can repro without webrender. Mozconfig at [1], exact command I'm running is
  DMD="--mode=live" xr ./mach mochitest --subsuite webgl --dmd

(where "xr" is [2])

[1] https://github.com/staktrace/moz-scripts/blob/master/mozconfig.Linux-host-ndebug-dmd
[2] https://github.com/staktrace/moz-scripts/blob/master/xr
Flags: needinfo?(bugmail)
So, in summary: when firefox starts up, gtk does a fork/exec to start ibus (which snorp says is some sort of input messaging thing). The code to do the fork/exec can be seen in [1]. After the fork, but before the exec, there are various codepaths, some of which can trigger a call to opendir(). That's what we see happening in the child process stack. The opendir() call calls malloc, which goes into DMD and tries to acquire the lock. This is all after the fork, so the DMD lock inherits whatever state it had at the time of the fork in the parent. But it's before the exec, so before the state is cleared.

[1] https://github.com/GNOME/glib/blob/master/glib/gspawn.c
Per glandium's excellent in-code comment it looks like we work around the same issue in |LogAlloc| [1] by acquiring the state lock prior to forking and releasing after forking. I'm just going to copy that and see if it works.

[1] http://searchfox.org/mozilla-central/rev/60ae6514e4c559c0c234f0e7aefccb101b8beb2e/memory/replace/logalloc/LogAlloc.cpp#82-111
In order to avoid the possibility of a deadlock if the DMD state lock is
currently acquired when forking a |pthread_atfork| hook is added to wait for
and acquire the lock prior to forking, then release it after forking.
Attachment #8840625 - Flags: review?(n.nethercote)
Assignee: nobody → erahm
Status: NEW → ASSIGNED
kats would you mind testing attachment 8840625 [details] [diff] [review]?
Flags: needinfo?(bugmail)
This seems to fix it, thanks!
Flags: needinfo?(bugmail)
Comment on attachment 8840625 [details] [diff] [review]
Avoid deadlocks in DMD when forking

Review of attachment 8840625 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you for the fix.
Attachment #8840625 - Flags: review?(n.nethercote) → review+
https://hg.mozilla.org/mozilla-central/rev/8d35b6c817c2
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
See Also: → 1342980
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: