在连接操作后如何将具有相同列名的数据帧写入csv文件。目前,我正在使用以下代码。 dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output/',header = 'true')
将在“ / home / user / output”中写入数据帧“ dfFinal”。但是,在数据帧中包含重复的列不能正常工作。下面是dfFinal数据框。
+----------+---+-----------------+---+-----------------+
| NUMBER | ID|AMOUNT | ID| AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092| 1| 30| 1| 40|
|9090909093| 2| 30| 2| 50|
|9090909090| 3| 30| 3| 60|
|9090909094| 4| 30| 4| 70|
+----------+---+-----------------+---+-----------------+
以上数据帧是在加入操作之后形成的。写入csv文件时,出现以下错误。
pyspark.sql.utils.AnalysisException: u'Found duplicate column(s) when inserting into file:/home/user/output: `amount`, `id`;'
答案 0 :(得分:0)
当您将连接列指定为字符串或数组类型时,它将仅导致一列[1]。 Pyspark示例:
l = [('9090909092',1,30),('9090909093',2,30),('9090909090',3,30),('9090909094',4,30)]
r = [(1,40),(2,50),(3,60),(4,70)]
left = spark.createDataFrame(l, ['NUMBER','ID','AMOUNT'])
right = spark.createDataFrame(r,['ID','AMOUNT'])
df = left.join(right, "ID")
df.show()
+---+----------+------+------+
| ID| NUMBER |AMOUNT|AMOUNT|
+---+----------+------+------+
| 1 |9090909092| 30 | 40 |
| 3 |9090909090| 30 | 60 |
| 2 |9090909093| 30 | 50 |
| 4 |9090909094| 30 | 70 |
+---+----------+------+------+
但是对于所有不是联接列(在此示例中为AMOUNT列)的列,这仍将在数据框中产生重复的列名。对于这些类型的列,您应该在使用toDF数据框函数[2]进行连接之前或之后分配一个新名称:
newNames = ['ID','NUMBER', 'LAMOUNT', 'RAMOUNT']
df= df.toDF(*newNames)
df.show()
+---+----------+-------+-------+
| ID| NUMBER |LAMOUNT|RAMOUNT|
+---+----------+-------+-------+
| 1 |9090909092| 30 | 40 |
| 3 |9090909090| 30 | 60 |
| 2 |9090909093| 30 | 50 |
| 4 |9090909094| 30 | 70 |
+---+----------+-------+-------+
[1] https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
[2] http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toDF