我已经尝试升级到Apache Spark 1.6.0 RC3.我的应用程序现在几乎每个任务都会发现这些错误:
Managed memory leak detected; size = 15735058 bytes, TID = 830
我已经设置日志记录级别org.apache.spark.memory.TaskMemoryManager
来DEBUG
看看在日志中:
I2015-12-18 16:54:41,125 TaskSetManager: Starting task 0.0 in stage 7.0 (TID 6, localhost, partition 0,NODE_LOCAL, 3026 bytes) I2015-12-18 16:54:41,125 Executor: Running task 0.0 in stage 7.0 (TID 6) I2015-12-18 16:54:41,130 ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks I2015-12-18 16:54:41,130 ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms D2015-12-18 16:54:41,188 TaskMemoryManager: Task 6 acquire 5.0 MB for null I2015-12-18 16:54:41,199 ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks I2015-12-18 16:54:41,199 ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms D2015-12-18 16:54:41,262 TaskMemoryManager: Task 6 acquire 5.0 MB for null D2015-12-18 16:54:41,397 TaskMemoryManager: Task 6 release 5.0 MB from null E2015-12-18 16:54:41,398 Executor: Managed memory leak detected; size = 5245464 bytes, TID = 6
你如何调试这些错误?有没有办法记录分配和解除分配的堆栈跟踪,所以我可以找到什么泄漏?
我对新的统一内存管理器(SPARK-10000)了解不多.泄漏可能是我的错,还是可能是Spark bug?
简短的回答是用户不应该看到此消息.用户不应该能够在统一内存管理器中创建内存泄漏.
这种泄漏发生的是Spark bug:SPARK-11293
但是如果你想了解内存泄漏的原因,我就是这样做的.
下载Spark源代码并确保您可以构建它并且您的构建工作.
在TaskMemoryManager.java
添加额外的日志中acquireExecutionMemory
和releaseExecutionMemory
:logger.error("stack trace:", new Exception());
将所有其他调试日志更改为错误TaskMemoryManager.java
.(比计算日志记录配置更容易......)
现在,您将看到所有分配和解除分配的完整堆栈跟踪.尝试匹配它们并找到没有解除分配的分配.您现在拥有泄漏源的堆栈跟踪.