我正在尝试从包含多个文件的目录创建数据框。在这些文件中,只有一个具有标题。我想使用推断模式选项从标题创建模式。
当我使用一个文件创建DF时,它正确地推断出架构。
flights = spark.read.csv("/sample/flight/flight_delays1.csv",header=True,inferSchema=True)
但是,当我正在读取目录中的所有文件时,它会抛出此错误。
flights = spark.read.csv("/sample/flight/",header=True,inferSchema=True)
18/04/21 23:49:18 WARN SchemaUtils: Found duplicate column(s) in the data schema and the partition schema: `11`. You might need to assign different column names.
flights.take(5)
18/04/21 23:49:27 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 476, in take
return self.limit(num).collect()
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 438, in collect
port = self._jdf.collectToPython()
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"Reference '11' is ambiguous, could be: 11#13, 11#32.;"
我知道一个解决方法是删除标题行,并手动定义架构,是否还有其他策略在一个文件上使用推断架构,然后将其他文件添加到DF?
答案 0 :(得分:0)
我建议你这样做:
#First, you infer the schema from the file you know
schm_file = spark.read.csv("/sample/flight/file_with_header.csv",header=True,inferSchema=True)
# Then you use the schema to read the other files
flights = spark.read.csv("/sample/flight/",header=False, mode='DROPMALFORMED',schema = schm_file.schema)
答案 1 :(得分:0)
我想出了另一种方式。但对于大量文件而言不够动态。我更喜欢@Steven建议的方式。
df1 = spark.read.csv("/sample/flight/flight_delays1.csv",header=True,inferSchema=True)
df2 = spark.read.schema(df1.schema).csv("/sample/flight/flight_delays2.csv")
df3 = spark.read.schema(df1.schema).csv("/sample/flight/flight_delays3.csv")
complete_df = df1.union(df2).union(df3)
complete_df.count()
complete_df.printSchema()