我需要将多个实木复合地板文件加载到spark数据框中,并告诉我从哪个实木复合地板文件加载了数据。加载数据时,有什么办法可以添加列?
答案 0 :(得分:1)
您可以将input_file_name
与reduce
和union
一起使用:
from pyspark.sql import functions as F
from functools import reduce
paths = ['first', 'second', 'third'] # your paths here
dataframes = [spark.read.parquet(path).withColumn(path, F.input_file_name()) for path in paths]
result = reduce(lambda x, y: x.union(y), dataframes)