Open Bug 1820832 Opened 1 year ago Updated 2 months ago

Crash in [@ nsIFrame::GetParent]

Categories

(Core :: Layout, defect)

defect

Tracking

()

REOPENED
Tracking Status
firefox-esr102 --- affected
firefox110 --- wontfix
firefox111 --- wontfix
firefox112 --- affected

People

(Reporter: aryx, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, Whiteboard: [no-nag])

Crash Data

[Tracking Requested - why for this release]:

This bug existed earlier but got more frequent with Firefox 110: 494 crashes until now compared to 124 for the whole Firefox 109 cycle. 25% on Windows 10, many other crashes on Windows 8.1 & 7.

Crash report: https://crash-stats.mozilla.org/report/index/f0913143-3e67-4568-8ab3-97d6f0230307

Reason: EXCEPTION_ACCESS_VIOLATION_WRITE

Top 10 frames of crashing thread:

0  xul.dll  nsIFrame::GetParent const  layout/generic/nsIFrame.h:895
0  xul.dll  mozilla::ViewportUtils::IsZoomedContentRoot  layout/base/ViewportUtils.cpp:246
0  xul.dll  nsIFrame::GetTransformMatrix::<lambda_0>::operator const  layout/generic/nsIFrame.cpp:7417
0  xul.dll  nsIFrame::GetTransformMatrix const  layout/generic/nsIFrame.cpp:7423
1  xul.dll  nsLayoutUtils::GetTransformToAncestor  layout/base/nsLayoutUtils.cpp:2089
1  xul.dll  TransformGfxRectToAncestor  layout/base/nsLayoutUtils.cpp:2343
2  xul.dll  nsLayoutUtils::TransformFrameRectToAncestor  layout/base/nsLayoutUtils.cpp:2581
2  xul.dll  nsLayoutUtils::TransformFrameRectToAncestor  layout/base/nsLayoutUtils.h:888
2  xul.dll  BoxToRect::AddBox  layout/base/nsLayoutUtils.cpp:3692
3  xul.dll  nsLayoutUtils::GetAllInFlowBoxes  layout/base/nsLayoutUtils.cpp:3598

46% of the crashes of Firefox 110.0.1 are in the first 5 minutes, and 88% with Intel HD Graphics 5500.

Jeff, could you take a look at this signature which started to spike around March 1st?

Flags: needinfo?(jmuizelaar)

I don't have any guesses what this would be. The graphics drivers aren't loaded into the parent so it doesn't seem likely that it would be because of that.

I guess it makes sense to keep in Layout cause that's where the code is crashing.

Blocks: gfx-triage
Flags: needinfo?(jmuizelaar)

92% of the crashes have CPU Info = family 6 model 61 stepping 4. So this seems like it could be a cpu bug.

And a high percent of the crashes are on Windows 8.1. Perhaps there is a microcode update that only happens if you have Windows 10 or newer for this cpu?

The bug is marked as tracked for firefox111 (beta). We have limited time to fix this, the soft freeze is in a day. However, the bug still isn't assigned.

:fgriffith, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(fgriffith)

Firefox 110.0.1 shipped at 100% on March 1 which aligns with the latest crash volume increase. There was already a crash frequency increase with 110.0 but to a lesser extent.

Flags: needinfo?(fgriffith) → needinfo?(emilio)

Emiio and I took a quick look. Will NI him to redirect.

Yeah, a lot of the crash reasons are also "impossible". E.g. nsIFrame::GetParent only reads memory, but we're crashing with a write-near-null error. Given Tim's observations in comment 3 and those, it seems this might not be actionable as a layout bug... Gabriele, is there any chance you could take a look and sanity-check us?

Flags: needinfo?(emilio) → needinfo?(gsvelto)

Let's first discount the crashes being reads: the addresses are all over the place and several look like bit-flips, we can probably chalk them up to flaky hardware.

As for the crashes that are writes there's an interesting pattern:

  • All crashes are from a very specific version of Broadwell CPUs: family 6 model 61 stepping 4
  • All crashes are running Windows 7 or Windows 8.1, this is important because Microsoft started shipping CPU microcode updates with Windows 10
  • The highest microcode versions in those crashes for that CPU is 0x19, the highest version available is 0x2f which confirms these CPUs did not receive microcode updates
  • And finally the smoking gun, the crashing instruction is mov rcx, qword [r13 + 0x30] which is a read not a write so this crash is impossible

This is most definitely a crash caused by a CPU bug.

CC'ing :afranchuk and :suhaib who are both working on different aspects of crash analysis. This is a very good example of a crash which we'd like to automatically identify as caused by hardware.

Flags: needinfo?(gsvelto)

As mentioned by :gsvelto, this crash was caused by hardware bug - dropping the topcrash keyword.

:gsvelto, is such case will be classified as hardware crash in the new information that will be available soon in crash reports? If so, the bot then could ignore such such crashes.

Flags: needinfo?(gsvelto)
Keywords: topcrash
Flags: needinfo?(dholbert)

(In reply to Suhaib Mujahid [:suhaib] from comment #10)

:gsvelto, is such case will be classified as hardware crash in the new information that will be available soon in crash reports? If so, the bot then could ignore such such crashes.

Yes, that's the idea. Given the crash reason and crashing instruction can be proven to be incompatible we should be able to catch it automatically in the stack walker.

Flags: needinfo?(gsvelto)

Silly bots.

Flags: needinfo?(dholbert)
Whiteboard: [no-nag]

We still have a handful of crashes in 111 that match all the conditions in comment 8. It's possible that changes in the build made the bug less likely to be triggered, but didn't remove it entirely.

No longer blocks: gfx-triage

Closing this out as WORKSFORME as comment 8 pretty definitively pins this on a CPU microcode bug.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WORKSFORME

Let's keep this open as long as we're getting crash volume (but we can classify it as low-severity and not worry too much about it, given comment 8).

Also: fortunately the recent spike (20-70 crashes/day) in early March seems to have gone away; we're back down to single-digit crashes per day.

Severity: -- → S4
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
See Also: → 1881375
Blocks: cpu-bugs
You need to log in before you can comment on or make changes to this bug.