Question

我正在使用pyspark在低端机器上聚合和分组较大的csv; 4 GB Ram和2 CPU核心。这样做是为了检查原型的内存限制。在聚合之后，我需要将RDD存储到另一台机器上运行的Cassandra。

我正在使用Datastax cassandra-python驱动程序。首先，我使用rdd.toLocalIterator并遍历RDD并使用驱动程序同步API session.execute。我设法在5米内插入大约100,000条记录 - 非常慢。检查这个，我发现这里解释python driver cpu bound，当在Cassandra节点中运行nload nw monitor时，python驱动程序发出的数据速度非常慢，导致速度慢

所以我尝试了session.execute_async，我可以看到NW转移速度非常快，插入时间也非常快。

这本来是一个快乐的故事，但是因为使用session.execute_async，我现在在插入更多表（使用不同的主键）时内存不足

由于rdd.toLocalIterator需要内存等于分区，我使用rdd.foreachPartition(x)将写操作转移到Spark工作者，但仍然没有内存。

我怀疑是不是导致这种情况的rdd迭代，而是快速序列化？ python驱动程序的execute_async（使用Cython）

当然我可以转移到更大的RAM节点并尝试;但是在这个节点中解决这个问题会很好;也许下次会尝试多处理;但如果有更好的建议请回复

我得到的内存错误来自JVM /或OS outofmemory，

6/05/27 05:58:45 INFO MapOutputTrackerMaster: Size of output statuses for 
shuffle 0 is 183 bytes
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fdea10cc000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/ec2-user/hs_err_pid3208.log

Answer 1

我尝试在具有更大RAM的机器中执行 - 16 GB;这次我能够避免上面的Out if Memory场景;

但是这次我将插入更改为插入多个表;

所以即使使用session.executeAsysc我也发现python驱动程序是CPU绑定的（我猜是因为GIL无法使用所有CPU内核），而NW中出现的是涓涓细流。

所以我无法达到案例2;计划现在更改为Scala

案例1：NW的输出非常少 - 写入速度快但无需写入

案例2：理想情况 - 插入IO限制：Cassandra写得非常快

Spark：PySpark Slowness，写给Cassandra的记忆问题

1 个答案: