Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21444

Fetch failure due to node reboot causes job failure

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.0
    • Scheduler, Spark Core
    • None

    Description

      We started seeing this issue after merging the PR - https://github.com/apache/spark/pull/17955.

      This PR introduced a change to keep the map-output statuses only in the map output tracker and whenever there is a fetch failure, the scheduler tries to invalidate the map-output statuses by talking all the the block manager slave end point. However, in case the fetch failure is because node reboot, the block manager slave end point would not be reachable and so the driver fails the job as a result. See exception below -

      2017-07-15 05:20:50,255 WARN  storage.BlockManagerMaster (Logging.scala:logWarning(87)) - Failed to remove broadcast 10 with removeFromMaster = true - Connection reset by peer
      java.io.IOException: Connection reset by peer
              at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
              at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
              at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
              at sun.nio.ch.IOUtil.read(IOUtil.java:192)
              at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
              at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
              at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
              at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
              at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
              at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
              at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
              at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
              at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
              at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
              at java.lang.Thread.run(Thread.java:745)
      2017-07-15 05:20:50,275 ERROR scheduler.DAGSchedulerEventProcessLoop (Logging.scala:logError(91)) - DAGSchedulerEventProcessLoop failed; shutting down SparkContext
      org.apache.spark.SparkException: Exception thrown in awaitResult
              at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
              at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
              at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
              at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
              at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
              at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
              at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
              at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:143)
              at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:271)
              at org.apache.spark.broadcast.TorrentBroadcast.doDestroy(TorrentBroadcast.scala:165)
              at org.apache.spark.broadcast.Broadcast.destroy(Broadcast.scala:111)
              at org.apache.spark.broadcast.Broadcast.destroy(Broadcast.scala:98)
              at org.apache.spark.ShuffleStatus.invalidateSerializedMapOutputStatusCache(MapOutputTracker.scala:197)
              at org.apache.spark.ShuffleStatus.removeMapOutput(MapOutputTracker.scala:105)
              at org.apache.spark.MapOutputTrackerMaster.unregisterMapOutput(MapOutputTracker.scala:420)
              at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1324)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1715)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1673)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1662)
              at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      Caused by: java.io.IOException: Connection reset by peer
              at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
              at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
              at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
              at sun.nio.ch.IOUtil.read(IOUtil.java:192)
              at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
              at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
              at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
              at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
              at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
              at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
              at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
              at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
              at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
              at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
              at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            joshrosen Josh Rosen Assign to me
            sitalkedia@gmail.com Sital Kedia
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment