Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure on some Android 11 devices when switching from secure decoder to a different decoder #8696

Closed
itsjamie opened this issue Mar 9, 2021 · 14 comments
Assignees

Comments

@itsjamie
Copy link

itsjamie commented Mar 9, 2021

Reproduction Steps

Set SimpleExoPlayer to play a MediaItem configured to play using Widevine, with M3U8 mimetype.
Email coming with replacements for <snip> values.

Exoplayer version: 2.13.2
Android Version: 11

Devices

  • Samsung S20 FE 5G (Qualcomm and Exynos chipsets both repro)
  • Samsung Notes - (tablet)
  • Samsung Galaxy Note - (phone)
  • Samsung Galaxy Note Ultra OS

Basic Reproduction Case:

val player = SimpleExoPlayer.Builder(applicationContext).build()
val mediaItem = MediaItem.Builder()
            .setUri(<snip>)
            .setMimeType(MimeTypes.APPLICATION_M3U8)
            .setDrmUuid(C.WIDEVINE_UUID)
            .setDrmLicenseUri(<snip>)
            .setDrmLicenseRequestHeaders(mutableMapOf(
                <snip>
            ))
            .build()

player.addMediaItem(mediaItem)
player.prepare()
player.play()

The content piece is a 45-second snippet of a live playlist setup as VOD.
It is 30 seconds of the content which is Widevine CTR encrypted content with a 15-second ad. On the switch to the ad, there is a discontinuity and switch to unencrypted content.

On the failure case:

I/ACodec: [OMX.Exynos.avc.dec] Now Loaded
D/SurfaceUtils: connecting to surface 0x71e621ca70, reason connectToSurface
E/SurfaceUtils: Failed to connect to surface 0x71e621ca70, err -22
E/MediaCodec: nativeWindowConnect returned an error: Invalid argument (-22)

When it does not fail, there is no failure to connect the surface with the code -22.

This does not occur on all Android 11 devices. And even on the Samsung devices we have seen it occur on, not all of them have a 100% reproduction rate. On a device with a high reproduction rate, we've also discovered that setting a logging breakpoint in SynchronousMediaCodecAdapter on the codec.configure line eliminates the issue.

Additionally, calling setDrmSessionForClearPeriods(true) on the MediaItem does resolve the error. But raises the question of what other issues this will potentially cause.

Content to follow in email, along with complete Logcat output from both successful and failure cases.

@itsjamie
Copy link
Author

itsjamie commented Mar 9, 2021

Email with test content has been sent.

@icbaker icbaker self-assigned this Mar 10, 2021
@icbaker
Copy link
Collaborator

icbaker commented Mar 10, 2021

To be honest this doesn't sound like an ExoPlayer issue - it seems more likely related to the device's implementation of secure/insecure video decoders.

Additionally, calling setDrmSessionForClearPeriods(true) on the MediaItem does resolve the error. But raises the question of what other issues this will potentially cause.

This setting was introduced for exactly this use-case - of clear ads in encrypted content leading to decoder switches, which resulted in a black screen being shown for a short time on some devices during the switch between ads & content (this also mostly affects Samsung devices iirc).

The only negative impact of using this setting I can think of is that it will use an additional DRM session on the device (which are a limited device-wide resource), but it's quite unlikely that you will end up constrained by session count.

If that setting resolves the issue for you then from ExoPlayer's perspective that's working-as-intended as a workaround for a device-specific issue. For the underlying issue itself I think you should raise the issue directly with the device OEMs.

@itsjamie
Copy link
Author

itsjamie commented Mar 10, 2021

@icbaker while it does present as a device-specific issue, isn't this highlighting a race condition in Exoplayer's access of system resources that happens to work now with the majority of vendors, but due to some misalignment in the code below, is on the flip side of the condition for Samsung?

I posit this because of the fact that adding the logging breakpoint in Android Studio resolves the issue, and the setup of the unsecured decoder session works as intended and doesn't throw an IllegalArgumentException when calling into the native_configure method.

From my own research yesterday, the error code of -22 seems to indicate that there are attempting to be two producers hooked up to the same Surface. Seeming to indicate a race between the release of the original secure session, and then the configure and attachment of the unsecured context.

I'll see if there is potentially a way we can get this in front of Samsung to cover the bases. It does feel like they changed something in the native MediaCodec timing of release and reattachment given the fact it is only impacting their devices.

@icbaker
Copy link
Collaborator

icbaker commented Mar 10, 2021

I understand how fixing it by setting a breakpoint points towards a race condition somewhere. I agree that indicates a timing-related issue, probably caused by the interaction of multiple threads.

I'm not sure how that could be in ExoPlayer though, because our control flow ensures that we release a decoder before acquiring the next one, and we do these operations from the same thread.

It seems plausible there is a race condition, but it's in Samsung's implementation of MediaCodec or similar APIs. I guess that means MediaCodec#release() is allowed to return before it's completely cleaned up and disconnected from a surface?

@fpitters
Copy link

We are seeing the same (or quite close) on One Plus 8T (11). On our case setDrmSessionForClearTypes doesn't seem to help.

Combining a widevine stream with some of the IMA tags:

      {
        "name": "Test with clear periods",
        "uri": "https://storage.googleapis.com/wvmedia/cenc/h264/tears/tears.mpd",
        "drm_scheme": "widevine",
        "drm_license_uri": "https://proxy.uat.widevine.com/proxy?provider=widevine_test",
        "drm_session_for_clear_content": true,
        "ad_tag_uri": "https://pubads.g.doubleclick.net/gampad/ads?sz=640x480&iu=/124319096/external/ad_rule_samples&ciu_szs=300x250&ad_rule=1&impl=s&gdfp_req=1&env=vp&output=vmap&unviewed_position_start=1&cust_params=deployment%3Ddevsite%26sample_ar%3Dpremidpost&cmsid=496&vid=short_onecue&correlator="
      },
      {
        "name": "Test without clear periods",
        "uri": "https://storage.googleapis.com/wvmedia/cenc/h264/tears/tears.mpd",
        "drm_scheme": "widevine",
        "drm_license_uri": "https://proxy.uat.widevine.com/proxy?provider=widevine_test",
        "ad_tag_uri": "https://pubads.g.doubleclick.net/gampad/ads?sz=640x480&iu=/124319096/external/ad_rule_samples&ciu_szs=300x250&ad_rule=1&impl=s&gdfp_req=1&env=vp&output=vmap&unviewed_position_start=1&cust_params=deployment%3Ddevsite%26sample_ar%3Dpremidpost&cmsid=496&vid=short_onecue&correlator="
      },

We get on both cases the same SurfaceUtils, MediaCodec error.
When using drm_session_for_clear_content it fails right after the preroll, and when not using the flag it failst when starting the first midroll.

I haven't been able to reproduce on any emulator or a Xiaomi Note 5 (Android 9).
I'll share the bug report by email in case it helps.

@icbaker
Copy link
Collaborator

icbaker commented Mar 11, 2021

Thanks for the report @fpitters - especially calling out that drm_session_for_clear_content doesn't resolve the issue.

In the logcat of your bug report it seems the secure decoder is used for the pre-roll (expected) then released at the end of the pre-roll (unexpected) and it's the subsequent reacquire/reconfigure to play the content that throws the error you're seeing:

// Acquire at the start of the pre-roll
D EventLogger: videoDecoderInitialized [eventTime=2.05, mediaPos=0.00, window=0, period=0, adGroup=0, ad=0, OMX.qcom.video.decoder.avc.secure]
// Release at the start of the content (end of the pre-roll)
D EventLogger: videoDecoderReleased [eventTime=12.14, mediaPos=0.00, window=0, period=0, OMX.qcom.video.decoder.avc.secure]
// Then the error is thrown

I tried reproducing this on a Pixel 3a XL and although I don't see any crashes I do see the same pattern of acquires/releases at the transition from pre-roll to content (but not from content to mid-roll or mid-roll to content) (this is with drm_session_for_clear_content: true:

// Pre-roll starts
D/EventLogger: videoDecoderInitialized [eventTime=1.35, mediaPos=0.00, window=0, period=0, adGroup=0, ad=0, c2.qti.avc.decoder.secure]
// Pre-roll ends
D/EventLogger: videoDecoderReleased [eventTime=11.76, mediaPos=0.00, window=0, period=0, c2.qti.avc.decoder.secure]
D/EventLogger: videoDecoderInitialized [eventTime=11.89, mediaPos=0.00, window=0, period=0, c2.qti.avc.decoder.secure]
// Content plays, then mid-roll, then content with no decoder changes (expected)

So I think that's an ExoPlayer bug - we should just keep the same decoder throughout.

As mentioned in the reply above the crashes seem like they might be an artefact of quickly (?) switching the same surface from one decoder to another - and it seems likely that part is device specific. I'll try and get hold of the devices you've both reported to see if I can reproduce the crashes.

@icbaker
Copy link
Collaborator

icbaker commented Mar 11, 2021

Ah I think I've found the cause of the 'spurious' codec re-initialization when switching from pre-roll to content.

The pre-roll resolution and inputSize is smaller than the content. Which means we need to re-initialize the codec, and it's triggered by this logic here (we enter both ifs):

@DecoderDiscardReasons int discardReasons = evaluation.discardReasons;
if (newFormat.width > codecMaxValues.width || newFormat.height > codecMaxValues.height) {
discardReasons |= DISCARD_REASON_VIDEO_MAX_RESOLUTION_EXCEEDED;
}
if (getMaxInputSize(codecInfo, newFormat) > codecMaxValues.inputSize) {
discardReasons |= DISCARD_REASON_MAX_INPUT_SIZE_EXCEEDED;
}

ExoPlayer doesn't know the content resolution in advance, so it starts by configuring the codec only large enough to play the ad - and then needs to reconfigure. This explains why we don't see any re-initialization at mid-roll, because the codec that was large enough to play the content can also play the ad and then be re-used.

So I think I retract my statement above that this part is an ExoPlayer bug - I'm not sure we can do any better than trigger a decoder re-initialization at this transition.

@itsjamie
Copy link
Author

Yeah, for us, we no longer saw the codec reinitialization, because our resolutions matched, and the format profile was within the bandwidth max size limitations. Even though there was a profile switch between the clear and secure content.

So, to that point though, this is one of those gotcha's I was wondering about and will pass along to the ad operations folks. Very important for these devices then that we keep the renditions in lockstep :)

Still looking for the best place to report/provide the reproduction to Samsung.

@icbaker icbaker changed the title Failure on Samsung devices: Secure -> Non-secure content switch (HLS) Failure on Samsung and OnePlus devices: Secure -> Non-secure content switch (HLS) Jun 23, 2021
@icbaker
Copy link
Collaborator

icbaker commented Jun 23, 2021

I got hold of a OnePlus AC2003 because it was reported as repro-ing this problem reliably in #9095.

On Android 10 I saw no problems, I switched from clear -> secure -> clear decoders every second for several minutes with no problems.

I updated the device to Android 11 and immediately saw the problem (on the first secure -> clear transition).

@icbaker
Copy link
Collaborator

icbaker commented Jun 24, 2021

This is also tracked by [internal b/191966399].

@icbaker icbaker changed the title Failure on Samsung and OnePlus devices: Secure -> Non-secure content switch (HLS) Failure on some Android 11 devices when switching from secure decoder to a different decoder Aug 4, 2021
@icbaker
Copy link
Collaborator

icbaker commented Aug 9, 2021

Adding some more info from some brief investigations, and responding to @cdongieux's request for additional workaround suggestions in #9250 (comment):

I experimented with a couple of hacks in MediaCodecVideoRenderer and found they both resolve this issue. We'd like to understand exactly what's going wrong inside the framework before submitting any workaround like this to the library, but you're free to make the changes to a local copy of the library (or subclass MediaCodecVideoRenderer). In both cases I overrode MediaCodecRenderer#releaseCodec in MediaCodecVideoRenderer:

  1. Based on @itsjamie's observations above that it's a timing issue I tried just adding naive Thread.sleep() calls after MediaCodec#release() and before we try and set the surface onto the new codec. In my overriding implementation of MediaCodecVideoRenderer#releaseCodec() I called super.releaseCodec() and then Thread.sleep(60). For the content I was using I found that 60ms was long enough, but 45ms wasn't. Obviously this is really hacky, and hard-coding sleep times like this isn't really a reliable or production-ready approach. I include it here mainly for completeness.
  2. Instead of sleeping after the release, I tried setting a dummy surface before the release. MediaCodecVideoRenderer already has a dummySurface field we can use (it might be null and need initializing). Calling getCodec().setOutputSurface(dummySurface) before super.releaseCodec() resolved the issue for me.

@stevemayhew
Copy link
Contributor

@icbaker We have seen something similar to this issue on both Broadcom and AmLogic streamer platforms. Happy to help if we can. For us the issue happens when moving from trick-play tracks (which, in this case, are not encrypted) to regular widevine encrypted tracks, the screen flashes to black for a frame before resuming playback. The hack we added that "fixes" it is to comment out the reset in onDisabled:

@Override
  protected void onDisabled() {
     inputFormat = null;
     if (sourceDrmSession != null || codecDrmSession != null) {
       // TODO: Do something better with this case.
       //onReset();
       flushOrReleaseCodec();

I think @ojw28 was looking at a way to be less drastic in shutting down the CODEC on track changes, but not sure where that work is.

@icbaker
Copy link
Collaborator

icbaker commented Sep 13, 2021

@stevemayhew Just to clarify: Do you just see a 'black flash' or a complete playback failure?

I agree the 'black flash' issue is similar, but this issue is specifically tracking a playback failure when switching a surface between codecs. The 'black flash' is tracked by #8842.

The hack we added that "fixes" it is to comment out the reset in onDisabled:

@Override
  protected void onDisabled() {
     inputFormat = null;
     if (sourceDrmSession != null || codecDrmSession != null) {
       // TODO: Do something better with this case.
       //onReset();
       flushOrReleaseCodec();

I think @ojw28 was looking at a way to be less drastic in shutting down the CODEC on track changes, but not sure where that work is.

We ended up basically doing the same in 2ab0e1d which is included in 2.14.1. Note that if you want to cherry-pick that change you'll want to take 037f24a too (which is needed for complicated reasons related to ExoPlayer#setForegroundMode).

icbaker added a commit to androidx/media that referenced this issue Dec 7, 2021
It's been observed that some devices fail when releasing a secure codec
attached to a surface and immediately trying to create a new codec
(secure or insecure) attached to the same surface. This change catches
all exceptions thrown during codec creation, sleeps for a short time,
and then retries the codec creation. This is observed to fix the problem
(we believe this is because it allows enough time for some background
part of the previous codec release operation to complete).

This change should have no effect on the control flow when codec
creation succeeds first time. It will introduce a slight delay when
creating the preferred codec fails (while we sleep and retry), which
will either delay propagating a permanent error or attempting to
initialize a fallback decoder. We can't avoid the extra delay to
instantiating the fallback decoder because we can't know whether we
expect the second attempt to create the preferred decoder to succeed or
fail. The benefit to always retrying the preferred decoder creation
(fixing playback failures) outweighs the unfortunate additional delay
to instantiating fallback decoders.

Issue: google/ExoPlayer#8696
#minor-release
PiperOrigin-RevId: 414671743
icbaker added a commit that referenced this issue Dec 8, 2021
It's been observed that some devices fail when releasing a secure codec
attached to a surface and immediately trying to create a new codec
(secure or insecure) attached to the same surface. This change catches
all exceptions thrown during codec creation, sleeps for a short time,
and then retries the codec creation. This is observed to fix the problem
(we believe this is because it allows enough time for some background
part of the previous codec release operation to complete).

This change should have no effect on the control flow when codec
creation succeeds first time. It will introduce a slight delay when
creating the preferred codec fails (while we sleep and retry), which
will either delay propagating a permanent error or attempting to
initialize a fallback decoder. We can't avoid the extra delay to
instantiating the fallback decoder because we can't know whether we
expect the second attempt to create the preferred decoder to succeed or
fail. The benefit to always retrying the preferred decoder creation
(fixing playback failures) outweighs the unfortunate additional delay
to instantiating fallback decoders.

Issue: #8696
#minor-release
PiperOrigin-RevId: 414671743
@icbaker
Copy link
Collaborator

icbaker commented Dec 8, 2021

The commit above works around this issue by retrying a failed decoder creation after a short delay.

@icbaker icbaker closed this as completed Dec 8, 2021
fpitters pushed a commit to fpitters/ExoPlayer that referenced this issue Jan 12, 2022
It's been observed that some devices fail when releasing a secure codec
attached to a surface and immediately trying to create a new codec
(secure or insecure) attached to the same surface. This change catches
all exceptions thrown during codec creation, sleeps for a short time,
and then retries the codec creation. This is observed to fix the problem
(we believe this is because it allows enough time for some background
part of the previous codec release operation to complete).

This change should have no effect on the control flow when codec
creation succeeds first time. It will introduce a slight delay when
creating the preferred codec fails (while we sleep and retry), which
will either delay propagating a permanent error or attempting to
initialize a fallback decoder. We can't avoid the extra delay to
instantiating the fallback decoder because we can't know whether we
expect the second attempt to create the preferred decoder to succeed or
fail. The benefit to always retrying the preferred decoder creation
(fixing playback failures) outweighs the unfortunate additional delay
to instantiating fallback decoders.

Issue: google#8696
PiperOrigin-RevId: 414671743
@google google locked and limited conversation to collaborators Feb 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants