我使用Sparklung Water,我正在读取镶木地板文件中的数据。
我的spark-default.conf的一部分:
`spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max 1g
spark.driver.memory 40g
spark.executor.memory 40g
spark.driver.maxResultSize 0
spark.python.worker.memory 30g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
spark.storage.safetyFraction 0.9
spark.storage.memoryFraction 0.0
`
15/11/26 11:44:46 WARN MemoryStore: Not enough space to cache rdd_7_472 in memory! (computed 3.2 MB so far)
15/11/26 11:44:46 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
15/11/26 11:44:46 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
15/11/26 11:44:46 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_7_474 in memory.
15/11/26 11:44:46 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_7_475 in memory.
实际上,Spark只使用它可以使用的一部分内存,并且有很多关于分配内存的错误。 Spark开始在硬盘驱动器上写入数据,而不是使用RAM。为什么会这样?可能是我应该在conf文件中更改一些内容?如何更改Java用作" tmp"的目录?
谢谢!
答案 0 :(得分:0)
Spark开始在硬盘驱动器上写入数据,而不是使用RAM。为什么这样做?
这应该是因为将您的持久性设置配置为使用选项 MEMORY_AND_DISK 。
从文档-> https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence 从源代码-> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala
private case class DeserializedMemoryEntry[T](
value: Array[T],
size: Long,
classTag: ClassTag[T]) extends MemoryEntry[T] {
val memoryMode: MemoryMode = MemoryMode.ON_HEAP
}
还有这个
// Initial memory to request before unrolling any block
private val unrollMemoryThreshold: Long =
conf.get(STORAGE_UNROLL_MEMORY_THRESHOLD)
再往下找到这个位
// Whether there is still enough memory for us to continue unrolling this block
var keepUnrolling = true
// Initial per-task memory to request for unrolling blocks (bytes).
val initialMemoryThreshold = unrollMemoryThreshold
// How often to check whether we need to request more memory
val memoryCheckPeriod = conf.get(UNROLL_MEMORY_CHECK_PERIOD)
// Memory currently reserved by this task for this particular unrolling operation
var memoryThreshold = initialMemoryThreshold
// Memory to request as a multiple of current vector size
val memoryGrowthFactor = conf.get(UNROLL_MEMORY_GROWTH_FACTOR)
// Keep track of unroll memory used by this particular block / putIterator() operation
var unrollMemoryUsedByThisBlock = 0L
这就是您看到的错误来自
// Request enough memory to begin unrolling
keepUnrolling =
reserveUnrollMemoryForThisTask(blockId, initialMemoryThreshold, memoryMode)
if (!keepUnrolling) {
logWarning(s"Failed to reserve initial memory threshold of " +
s"${Utils.bytesToString(initialMemoryThreshold)} for computing block $blockId in memory.")
} else {
unrollMemoryUsedByThisBlock += initialMemoryThreshold
}
因此,您可以像在本博客中一样在应用程序级别启用OFF_HEAP-> https://www.waitingforcode.com/apache-spark/apache-spark-off-heap-memory/read 或者您按照下面的说明调整群集/计算机配置并启用此设置-> https://spark.apache.org/docs/latest/configuration.html#memory-management
最后,如果以上方法均无济于事,就我而言,重新启动节点摆脱了警告。
答案 1 :(得分:0)
以防万一,如果您落入这篇文章而仍想知道发生了什么,请参考上面的答案,以了解如何以及为什么落入此错误。
对我来说,我真的会看(computed 3.2 MB so far)
,开始担心!
但是,为了解决:
在创建spark.storage.memoryFraction
时将1
标志设置为sparkContext
,以利用最多XXGb的内存,默认为提供的总内存的0.6。
还可以考虑设置:
rdd.compression
至true
和
如果您的数据大于可用内存,则将 StorageLevel
设为MEMORY_ONLY_SER
。 (您也可以尝试MEMORY_AND_DISK_SER
)。
只是翻阅旧邮件并偶然发现了以下属性:
**spark.shuffle.spill.numElementsForceSpillThreshold**
我们将其设置为-conf spark.shuffle.spill.numElementsForceSpillThreshold = 50000 ,此问题已解决,但是对于特定的用例,该值需要进行调整(尝试将其值降低至40000或30000)。
到目前为止,spark具有两个新参数:
-spark.shuffle.spill.map.maxRecordsSizeForSpillThreshold
-spark.shuffle.spill.reduce.maxRecordsSizeForSpillThreshold
参考:
希望有帮助!干杯!