将多个实木复合地板加载到Spark数据框中

时间:2019-05-14 07:10:32

标签: pyspark parquet

我需要将多个实木复合地板文件加载到spark数据框中,并告诉我从哪个实木复合地板文件加载了数据。加载数据时,有什么办法可以添加列?

1 个答案:

答案 0 :(得分:1)

您可以将input_file_namereduceunion一起使用:

from pyspark.sql import functions as F
from functools import reduce

paths = ['first', 'second', 'third']  # your paths here
dataframes = [spark.read.parquet(path).withColumn(path, F.input_file_name()) for path in paths]

result = reduce(lambda x, y: x.union(y), dataframes)