Question

我正在尝试这样做，我得到一个很长的错误：

df=df.withColumn('NewColumnName', someother_df['Time'])

它不起作用。这样做：

df=df.withColumn('NewColumnName', someother_df.select('Time'))

给我这个错误：AssertionError：col应该是Column

Answer 1

您似乎将两个数据框合并而没有任何常用密钥，因此下面的代码应该适合您。

import pyspark.sql.functions as func

df1 = sc.parallelize([('1234','13'),('6789','68')]).toDF(['col1','col2'])
df2 = sc.parallelize([('7777','66'),('8888','22')]).toDF(['col3','col4'])

# since there are no common column between these two dataframes add row_index so that it can be joined
df1=df1.withColumn('row_index', func.monotonically_increasing_id())
df2=df2.withColumn('row_index', func.monotonically_increasing_id())

# 'col3' from second dataframe (i.e. df2) is added to first dataframe (i.e. df1)
df1 = df1.join(df2["row_index","col3"], on=["row_index"]).drop("row_index")
df1.show()

如果它解决了您的问题，请不要告诉我们。）

如何在pyspark中向dataframe添加新列

1 个答案: