spark io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s)

Posted by godster on Fri, 17 Dec 2021 02:33:23 +0100

Error log:

1. Error log:

​
 WARN TaskSetManager: Lost task 29.0 in stage 22.0 (TID 1851, wn108-cdlcns.bjduloineequ3adfbkrpgi4p2c.shax.internal.chinacloudapp.cn, executor 25): FetchFailed(BlockManagerId(2, wn107-cdlcns.bjduloineequ3adfbkrpgi4p2c.shax.internal.chinacloudapp.cn, 7447, None), shuffleId=12, mapId=4, reduceId=29, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 5721030656, max: 5726797824)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:459)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 5721030656, max: 5726797824)
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
	at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185)
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176)
	at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137)
	at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more

)
21/06/08 13:32:08 WARN TaskSetManager: Lost task 43.0 in stage 22.0 (TID 1865, wn108-cdlcns.bjduloineequ3adfbkrpgi4p2c.shax.internal.chinacloudapp.cn, executor 26): FetchFailed(BlockManagerId(2, wn107-cdlcns.bjduloineequ3adfbkrpgi4p2c.shax.internal.chinacloudapp.cn, 7447, None), shuffleId=12, mapId=4, reduceId=43, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 5721030656, max: 5726797824)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:459)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 5721030656, max: 5726797824)
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
	at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185)
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176)
	at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137)
	at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more

)

​

2. Error analysis:

Shuffle is divided into two parts: shuffle write and shuffle read.

These two films adopt the netty framework, and the log is the problem of netty memory leakage
The number of partitions in shuffle write is controlled by the number of RDD partitions in the previous stage, and the number of partitions in shuffle read is controlled by some parameters provided by Spark.

shuffle write can be simply understood as an operation similar to saveAsLocalDiskFile, which temporarily places the intermediate results of calculation on the local disk of each executor according to some rules.

During shuffle read, the number of data partitions is controlled by some parameters provided by spark. It is conceivable that if the parameter value is set to a small value and the amount of shuffle reads is large, the data to be processed by a task will be very large. As a result, the JVM crash es, resulting in the failure of fetching shuffle data, and the executor is also lost. You see the error of Failed to connect to host, which means executor lost. Sometimes even if it doesn't lead to JVMcrash, it will create gc for a long time.
 

3. Solution:

After knowing the reason, the problem can be solved, mainly from the perspectives of the amount of shuffle data and the number of partitions for processing shuffle data.

1. Reduce shuffle data

Think about whether you can use map side join or broadcast join to avoid shuffle.

Filter unnecessary data before shuffle. For example, the original data has 20 fields. Just select the required fields for processing, which will reduce some shuffle data.

2. join,groupby and other operations of SparkSQL and DataFrame

Through spark sql. shuffle. Partitions controls the number of partitions, which is 200 by default. This value is increased according to the amount of shuffle and the complexity of calculation.

3. Rdd's join,groupBy,reduceByKey and other operations

Through spark default. Parallelism controls the number of partitions processed by shuffle AD and reduce. By default, it is the total number of cores running tasks (8 in mesos fine-grained mode and the total number of local cores in local mode). It is officially recommended to set it to 2-3 times the number of cores running tasks.

4. Increase the memory of the executor

Through spark executor. Memory properly increase the memory value of the executor.

5. Is there a problem of data skew

Has the null value been filtered? Can abnormal data (a key data is very large) be handled separately? Consider changing the data partition rules.
 

Topics: Scala Big Data Spark