在PySpark中连接两个数据帧时,避免列重复列名称

时间:2017-02-03 13:47:34

标签: apache-spark pyspark spark-dataframe

我有以下代码:

from pyspark.sql import SQLContext
ctx = SQLContext(sc)
a = ctx.createDataFrame([("1","a",1),("2","a",1),("3","a",0),("4","a",0),("5","b",1),("6","b",0),("7","b",1)],["id","group","value1"])
b = ctx.createDataFrame([("1","a",8),("2","a",1),("3","a",1),("4","a",2),("5","b",1),("6","b",3),("7","b",4)],["id","group","value2"])
c = a.join(b,"id")
c.select("group")

它返回错误:

pyspark.sql.utils.AnalysisException: Reference 'group' is ambiguous, could be: group#1406, group#1409.;

问题是c的列"group"的列数是>>> c.columns ['id', 'group', 'value1', 'group', 'value2'] 的两倍:

c.select("a.group")

我希望能够以{{1}}为例,但我不知道在进行连接时如何自动调整列名。

1 个答案:

答案 0 :(得分:2)

只需删除引号:c.select(a.group),它就会从group数据框中选择a列。