如何在pyspark中向dataframe添加新列

时间:2017-07-28 14:23:51

标签: python apache-spark pyspark

我正在尝试这样做,我得到一个很长的错误:

df=df.withColumn('NewColumnName', someother_df['Time'])

它不起作用。 这样做:

df=df.withColumn('NewColumnName', someother_df.select('Time'))

给我这个错误:AssertionError:col应该是Column

1 个答案:

答案 0 :(得分:2)

您似乎将两个数据框合并而没有任何常用密钥,因此下面的代码应该适合您。

import pyspark.sql.functions as func

df1 = sc.parallelize([('1234','13'),('6789','68')]).toDF(['col1','col2'])
df2 = sc.parallelize([('7777','66'),('8888','22')]).toDF(['col3','col4'])

# since there are no common column between these two dataframes add row_index so that it can be joined
df1=df1.withColumn('row_index', func.monotonically_increasing_id())
df2=df2.withColumn('row_index', func.monotonically_increasing_id())

# 'col3' from second dataframe (i.e. df2) is added to first dataframe (i.e. df1)
df1 = df1.join(df2["row_index","col3"], on=["row_index"]).drop("row_index")
df1.show()


如果它解决了您的问题,请不要告诉我们。)