Question

我需要将多个实木复合地板文件加载到spark数据框中，并告诉我从哪个实木复合地板文件加载了数据。加载数据时，有什么办法可以添加列？

Answer 1

您可以将input_file_name与reduce和union一起使用：

from pyspark.sql import functions as F
from functools import reduce

paths = ['first', 'second', 'third']  # your paths here
dataframes = [spark.read.parquet(path).withColumn(path, F.input_file_name()) for path in paths]

result = reduce(lambda x, y: x.union(y), dataframes)

将多个实木复合地板加载到Spark数据框中

1 个答案: