Closed Bug 1279514 Opened 8 years ago Closed 8 years ago

Crash in IPCError-browser | (msgtype=0xEC0003,name=PTCPSocket::Msg_Data) Processing error: message was deserialized, but the han

Categories

(Core :: WebRTC: Networking, defect, P1)

x86
Windows 7
defect

Tracking

()

RESOLVED FIXED
Tracking Status
e10s + ---
firefox48 --- affected
firefox50 --- affected

People

(Reporter: jimm, Assigned: drno)

References

Details

(Keywords: crash)

Crash Data

https://crash-stats.mozilla.com/report/index/e38e9793-07e3-40b1-a401-eacd72160609

Pretty high up content process crasher in beta 48 build 1.
Blocks: e10s-crashes
tracking-e10s: --- → ?
I don't understand the stacks in any of the reports I looked at. The TCPSocket data message is a message sent from child to parent; all of the stacks show content processes doing very non-TCPSocket things.
https://crash-stats.mozilla.com/report/index/74ab1a0b-da10-44f6-b673-93c3d2160610#allthreads does actually show thread 11 doing TCP-related stuff, which suggests that e10s webrtc is triggering these crashes. I still can't sort out where the IPC error gets reported though, since deserialization should be happening on the parent side (barring bug 1268900 which hasn't landed yet).
This one shows dispatching of a message:
https://crash-stats.mozilla.com/report/index/1674fb96-28c7-4ede-8a50-6462c2160505#allthreads

it is coming from media/mtransport

I will move it to media/.. so that someone from there could take a look, and tell us more  what is happening.
Component: Networking → WebRTC: Networking
(In reply to Josh Matthews [:jdm] from comment #3)
> https://crash-stats.mozilla.com/report/index/74ab1a0b-da10-44f6-b673-
> 93c3d2160610#allthreads does actually show thread 11 doing TCP-related
> stuff, which suggests that e10s webrtc is triggering these crashes.

So thread 11 in here is apparently trying to write to a TCP connection to the configured HTTP proxy. WebRTC uses (so far) TCP connections to talk to media relays (TURN servers), and in case a HTTP proxy is configured it will try to talk such a media relay through the HTTP proxy.
(In reply to Josh Matthews [:jdm] from comment #3)
> https://crash-stats.mozilla.com/report/index/74ab1a0b-da10-44f6-b673-
> 93c3d2160610#allthreads does actually show thread 11 doing TCP-related
> stuff, which suggests that e10s webrtc is triggering these crashes.

On a second look: yes the WebRTC HTTP Proxy tunnel code just got it's callback that the e10s TCP socket to the proxy connected. But it only tries to log its initial log messages into our internal ICE ring log buffer. But it has not send any data yet.
And the crashing thread 0 in this case does not show anything related to IPC TCP, but some JS garbage collection stuff. So I'm wondering what makes crash-stats believe this particular crash is related to this bug here.
In this case: https://crash-stats.mozilla.com/report/index/627703d2-9aa3-4eb4-bf75-6aba42160618#allthreads
Thread 11 actually tries to log an error messages about a failure to write to a TCP connection (this could be a direct connection to the media relay or through a HTTP proxy I think).
Also in this case there was clearly a WebRTC call going on with audio and video decoding happening in multiple threads.
After looking at a couple of the crashes it looks like quite a lot of them show the following:
- one thread in webrtc::AudioDeviceWindowsCore::DoCaptureThread()
- another thread doing webrtc::AudioDeviceWindowsCore::DoGetCaptureVolumeThread()
- and a third thread doing webrtc::AudioDeviceWindowsCore::DoSetCaptureVolumeThread()
Example: in https://crash-stats.mozilla.com/report/index/0e342207-2893-40e4-85f3-b13f22160616#allthreads threads 36, 37 and 38 appear to be blocked in the above functions.

@padenot: is it normal/expected that one thread captures, while another appears to change the volume and yet another tries to read the volume?
Flags: needinfo?(padenot)
Rank: 15
Priority: -- → P1
It's weird code, but it's what is indented by webrtc.org developers. This is happening all the time when capturing using the webrtc.org code on Windows. This is going away soon, it's been replaced by better code as part of the full duplex project.
Flags: needinfo?(padenot)
See Also: → 1275216
If Socorro is correctly classifying crashes in the same category (I have no idea how reliable that Sorocco feature is) then we appear to have at least one user report in bug 1275216 who claims that the combination of TCP+TURN+HTTP-Proxy causes this crash.
Assignee: nobody → drno
Rank: 15 → 11
So the answer is: this is the TCP filtering code from bug 1244926 which does it's job and prevents any connection where the initial packets are not ICE/STUN.

In other words: having to do an HTTP CONNECT first before exchanging STUN packets was overlooked when we designed the TCP packet filter.
Depends on: 1244926
Two options come my mind:

A) Based on the proxy settings of FF activate a different filter, which enforces that the first outgoing message is an HTTP CONNECT. Problem though is without enforcing that the destination is actually only the destination from the FF configuration it means someone could simply connect to any open proxy on the Internet and do what he wants. And enforcing the destination is tricky with DNS resolution giving us different results.

B) If I'm not mistaken the Necko HTTP client actually lives in the parent process. So if we remove our HTTP proxy code and use the Necko client instead we could hopefully still enforce the filtering of initial STUN packets, once the Necko HTTP client got a success response from the HTTP proxy.

In either case we probably need to disable the TCP filtering until we have a working solution to not have e10s clients crash once e10s gets released :-(
Depends on: 1285318
I'll leave this open with NI to check back on crash stats in a couple of weeks to confirm that we no longer see this crash from 48.0b7 on forward.
Flags: needinfo?(drno)
As expected I don't see any crashes from 48.0b7 on forward.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(drno)
Resolution: --- → FIXED
Crash volume for signature 'IPCError-browser | (msgtype=0xEC0003,name=PTCPSocket::Msg_Data) Processing error: message was deserialized, but the han':
 - nightly (version 50): 0 crashes from 2016-06-06.
 - aurora  (version 49): 0 crashes from 2016-06-07.
 - beta    (version 48): 715 crashes from 2016-06-06.
 - release (version 47): 0 crashes from 2016-05-31.
 - esr     (version 45): 0 crashes from 2016-04-07.

Crash volume on the last weeks:
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly       0       0       0       0       0       0       0
 - aurora        0       0       0       0       0       0       0
 - beta         26      43     103     120     122      99     126
 - release       0       0       0       0       0       0       0
 - esr           0       0       0       0       0       0       0

Affected platforms: Windows, Mac OS X, Linux
You need to log in before you can comment on or make changes to this bug.