Question

我正在尝试运行一些火花作业，但通常执行程序会耗尽内存：

17/02/06 19:12:02 WARN TaskSetManager: Lost task 10.0 in stage 476.3 (TID 133250, 10.0.0.10): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1486378087852_0006_01_000019 on host: 10.0.0.10. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_1486378087852_0006_01_000019
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)

由于我已经设置了spark.executor.memory=20480m，我觉得这个工作不应该真正需要更多的RAM来工作，所以我看到的另一个选择是增加分区的数量。

我试过了：

>>> sqlContext.setConf("spark.sql.shuffle.partitions", u"2001")
>>> sqlContext.getConf("spark.sql.shuffle.partitions")
u'2001'

和

>>> all_users.repartition(2001)

然而，当我开始工作时，我仍然看到默认的200个分区：

>>> all_users.repartition(2001).show()
[Stage 526:(0 + 30) / 200][Stage 527:>(0 + 0) / 126][Stage 528:>(0 + 0) / 128]0]

我在Azure HDInsight上使用PySpark 2.0.2。谁能指出我做错了什么？

修改

根据下面的答案我试过：

sqlContext.setConf('spark.sql.shuffle.partitions', 2001)

一开始但它不起作用。但是，这有效：

sqlContext.setConf('spark.sql.files.maxPartitionBytes', 100000000)

all_users是一个sql数据帧。一个具体的例子是：

all_users = sqlContext.table('RoamPositions')\ 
    .withColumn('prev_district_id', F.lag('district_id', 1).over(user_window))\ 
    .withColumn('prev_district_name', F.lag('district_name', 1).over(user_window))\
    .filter('prev_district_id IS NOT NULL AND prev_district_id != district_id')\
    .select('timetag', 'imsi', 'prev_district_id', 'prev_district_name', 'district_id', 'district_name')

Answer 1

根据您的评论，在您致电repartition之前，您似乎从外部来源读取数据并使用窗口函数。窗口功能：

如果没有提供partitionBy子句，则将数据重新分区到单个分区。
如果您提供partitionBy子句，请使用标准的随机播放机制。

后者似乎就是这种情况。由于spark.sql.shuffle.partition的默认值为200，因此在重新分区之前，您的数据将被拖放到200个分区中。如果你想要2001一路，你应该在之前>加载数据

sqlContext.setConf("spark.sql.shuffle.partitions", u"2001") all_users = ...

spark.sql.shuffle.partitions也不会影响初始分区的数量。这些可以使用其他属性进行控制：How to increase partitions of the sql result from HiveContext in spark sql

如何传播工作，以免内存不足

1 个答案: