Question

所以我有这个名为main_DF的主数据框，其中包含所有测量值：

main_DF
group    index   width    height
--------------------------------
1        1       21.3       15.2
1        2       11.3       45.1
2        3       23.2       25.2
2        4       26.1       85.3
...
23       986453  26.1       85.3

另一个名为selected_DF的表格，源自main_DF，其中包含了一个开始＆amp; main_DF中重要行的结束索引以及长度（end_index - start_index）。字段start_index和end_index与index中的字段main_DF相对应。

selected_DF
group    start_index   end_index    length
--------------------------------
1        1             154          153
2        236           312          76
3        487           624          137
...
238      17487         18624        1137

现在，对于selected_DF中的每一行，我需要对start_index和end_index之间的所有测量值执行过滤。例如，假设第1行是index = 1直到154.经过一些过滤后，从该行派生的数据帧是：

peak_DF
peak_start   peak_end
--------------------------------
1            12
15           21
27           54
86           91
...
143          150

peak_start和peak_end表示width超过阈值的区域。它是通过选择所有width > threshold获得的，然后检查其index的位置（抱歉，但即使使用代码也难以解释）

然后我需要根据width获取测量值（peak_DF）并计算平均值，使其类似于：

peak_DF_summary
peak_start   peak_end    avg_width
--------------------------------
1            12          25.6
15           21          35.7
27           54          24.2
86           91          76.6
...
143          150         13.1

最后，计算avg_width的平均值，并保存结果。

之后，幕布移动到selected_DF中的下一行，依此类推。

到目前为止，我以某种方式设法通过此代码获得了我想要的东西：

val main_DF = spark.read.parquet("hdfs_path_here")
df.createOrReplaceTempView("main_DF")
val selected_DF = spark.read.parquet("hdfs_path_here").collect.par //parallelized array
val final_result_array = scala.collection.mutable.ArrayBuffer.empty[Array[Double]] //for final result

selected_DF.foreach{x => 
    val temp = x.split(',')
    val start_index = temp(1)
    val end_index = temp(2)

    //obtain peak_start and peak_end (START)
    val temp_df_1 = spark.sql( " (SELECT index, width, height FROM main_DF WHERE width > 25 index BETWEEN " + start_index + " AND " + end_index + ")")

    val temp_df_2 = temp_df_1.withColumn("next_index", lead(temp_df("index"), 1).over(window_spec) ).withColumn("previous_index", lag(temp_df("index"), 1).over(window_spec) )

    val temp_df_3 = temp_df_2.withColumn("rear_gap", temp_df_2.col("index") - temp_df_2.col("previous_index") ).withColumn("front_gap", temp_df_2.col("next_index") - temp_df_2.col("index") )

    val temp_df_4 = temp_df_3.filter("front_gap > 9 or rear_gap > 9")

    val temp_df_5 = temp_df_4.withColumn("next_front_gap", lead(temp_df_4("front_gap"), 1).over(window_spec) ).withColumn("next_front_gap_index", lead(temp_df_4("index"), 1).over(window_spec) )

    val temp_df_6 = temp_df_5.filter("rear_gap > 9 and next_front_gap > 9").sort("index")
    //obtain peak_start and peak_end (END)

    val peak_DF = temp_df_6.select("index" , "next_front_gap_index").toDF("peak_start", "peak_end").collect

    val peak_DF_temp = peak_DF.map { y =>
        spark.sql( " (SELECT avg(width) as avg_width FROM main_DF WHERE index BETWEEN " + y(0) + " AND " + y(1) + ")")
    }

    val peak_DF_summary = peak_DF_temp.reduceLeft( (dfa, dfb) => dfa.unionAll(dfb) )

    val avg_width = peak_DF_summary.agg(mean("avg_width")).as[(Double)].first

    final_result_array += avg_width._1
}
spark.catalog.dropTempView("main_DF")

（reference）

问题是，代码只能运行到中途（20-30次迭代后），直到它崩溃并发出java.lang.OutOfMemoryError: Java heap space。不过，当我以1比1的速度运行迭代时，它运行正常。

所以我的问题是：

内存怎么样？我认为原因应该是积累的内存使用量，所以我为每个添加.unpersist() foreach循环内的数据帧（尽管我没有.persist()）无济于事。但是，每次内存消耗都应该重置在foreach中输入新迭代时重新启动变量循环，没有？
有没有有效的方法来进行这种计算？我是在Spark中做嵌套循环，我觉得这是非常的效率低下的方法，但到目前为止，这是我能得到的唯一方法结果

我正在使用CDH 5.7和Spark 2.1.0。我的群集有6个节点，32GB内存（每个）和40个核心（总计）。 main_DF基于30GB镶木地板文件。

如何在Spark / Scala中有效地执行嵌套循环？

0 个答案: