尽管数量不增长,但数据框规模仍在不断增长

时间:2016-11-15 17:13:42

标签: apache-spark spark-dataframe

我需要一些帮助 当我使用for循环来更新数据帧时,我遇到了apache-spark的问题。虽然其数量没有增长,但它的规模仍在不断增长

你可以建议我如何解决它或指导我为什么我的数据框大小一直在增长? (T ^ T)//
我的程序使用spark2.0.1在本地[6]上运行

@这是我的代码

def main(args: Array[String]): Unit = {
    val df1 = initial dataframe(read from db)
    while(){
        val word_count_df = processAndCountText() // query data from database and do word count
        val temp_df1 = update(df1,word_count_df )
        temp_df1.persist(StorageLevel.MEMORY_AND_DISK)
        df1.unpersist()
        df1 = temp_df1

        println(temp_df1.count())
        println(s"${SizeEstimator.estimate(temp_df1) / 1073741824.0} GB")
    }
}

//被修改
这是更新函数,用于更新word_count_df中具有键的某行 我试图将它拆分为2个数据帧并单独计算它然后返回2个数据帧的并集,但它需要花费太多时间,因为它需要启用" spark.sql.crossJoin.enabled"

def update(u_stateful_df : DataFrame, word_count_df : DataFrame) : DataFrame = {
    val run_time = current_end_time_m - start_time_ms / 60000
    val calPenalty = udf { (last_update_duration: Long, run_time: Long) => calculatePenalty(last_update_duration, run_time) }
    //calculatePenalty is simple math function using for loop and return double
    val calVold = udf { (v_old: Double, penalty_power: Double) => v_old * Math.exp(penalty_power) }


    //(word_new,count_new)
    val word_count_temp_df = word_count_df
            .withColumnRenamed("word", "word_new")
            .withColumnRenamed("count", "count_new")

    //u_stateful_df  (word,u,v,a,last_update,count)
    val state_df = u_stateful_df
            .join(word_count_temp_df, u_stateful_df("word") === word_count_temp_df("word_new"), "outer")
            .na.fill(Map("last_update" -> start_time_ms / 60000))
            .na.fill(0.0)
            .withColumn("word", when(col("word").isNotNull, col("word")).otherwise(col("word_new")))
            .withColumn("count", when(col("word_new").isNotNull, col("count_new")).otherwise(-1))
            .drop("count_new")
            .withColumn("current_end_time_m", lit(current_end_time_m))
            .withColumn("last_update_duration", col("current_end_time_m") - col("last_update"))
            .filter(col("last_update_duration") < ResourceUtility.one_hour_duration_ms / 60000)
            .withColumn("run_time", when(col("word_new").isNotNull, lit(run_time)))
            .withColumn("penalty_power", when(col("word_new").isNotNull, calPenalty(col("last_update_duration"), col("run_time"))))
            .withColumn("v_old_penalty", when(col("word_new").isNotNull, calVold(col("v"), col("penalty_power"))))
            .withColumn("v_new", when(col("word_new").isNotNull, col("count") / run_time))
            .withColumn("v_sum", when(col("word_new").isNotNull, col("v_old_penalty") + col("v_new")))
            .withColumn("a", when(col("word_new").isNotNull, (col("v_sum") - col("v")) / col("last_update_duration")).otherwise(col("a")))
            .withColumn("last_update", when(col("word_new").isNotNull, lit(current_end_time_m)).otherwise(col("last_update")))
            .withColumn("u", when(col("word_new").isNotNull, col("v")).otherwise(col("u")))
            .withColumn("v", when(col("word_new").isNotNull, col("v_sum")).otherwise(col("v")))

    state_df.select("word", "u", "v", "a", "last_update", "count")
}

@这是我的日志

u_stateful_df : 1408665
size of dataframe size : 0.8601360470056534 GB

u_stateful_df : 1408665
size of dataframe size : 1.3347024470567703 GB

u_stateful_df : 268498
size of dataframe size : 1.5012029185891151 GB

u_stateful_df : 147232
size of dataframe size : 3.287795402109623 GB

u_stateful_df : 111950
size of dataframe size : 4.761911824345589 GB

....
....

u_stateful_df : 72067
size of dataframe size : 14.510709017515182 GB

@这是我将其写入文件时的日志

I save df1 as CSV in the file system. below is the size of dataframe in file system, count and size(track by org.apache.spark.util.SizeEstimator).     



csv size 84.2 MB     
u_stateful_df : 1408665     
size of dataframe size : 0.4460855945944786 GB     



csv size 15.2 MB     
u_stateful_df : 183315     
size of dataframe size : 0.522 GB     



csv size 9.96 MB     
u_stateful_df : 123381     
size of dataframe size : 0.630GB     



csv size 4.63 MB     
u_stateful_df : 56896     
size of dataframe size : 0.999 GB

...
...
...

csv size 3.47 MB
u_stateful_df : 43104
size of dataframe size : 3.1956922858953476 GB

1 个答案:

答案 0 :(得分:0)

看起来Spark里面有些漏洞。通常当您在Dataframe上调用persistcache然后count Spark生成结果并将其存储在分布式内存或磁盘上时,还知道整个执行计划以重建该Dataframe以防丢失执行人或其他东西。但它不应该占用太多空间......

据我所知,没有选项可以“折叠”Dataframe(告诉Spark忘记整个执行计划),只需写入存储然后从该存储中读取即可。