Question

我正在尝试使用pyspark中的unionAll函数来连接多个数据帧。

这就是我要做的：

df_list = []

for i in range(something):
    normalizer = Normalizer(inputCol="features", outputCol="norm", p=1)
    norm_df = normalizer.transform(some_df)
    norm_df = norm_df.repartition(320)
    data = index_df(norm_df)
    data.persist()
    mat = IndexedRowMatrix(
        data.select("id", "norm")\
            .rdd.map(lambda row: IndexedRow(row.id, row.norm.toArray()))).toBlockMatrix()
    dot = mat.multiply(mat.transpose())
    df = dot.toIndexedRowMatrix().rows.toDF()
    df_list.append(df)

big_df = reduce(unionAll, df_list)
big_df.write.mode('append').parquet('some_path')

我想这样做是因为写入部分要花费时间，因此，在我的情况下，写入一个大文件比n个小文件要快得多。

问题是当我编写big_df并检查Spark UI时，我有太多的任务来编写镶木地板。虽然我的目标是编写一个大数据帧，但实际上它会写入所有子数据帧。

有猜到吗？

Answer 1

火花被懒惰地评估。 write操作是触发所有先前转换的操作。因此，这些任务是用于这些转换的，而不仅仅是用于编写实木复合地板。

PySpark中多个数据框的迭代联合

1 个答案: