Question

I have a Pyspark Dataframe with this structure:

+----+----+----+----+---+
|user| A/B|   C| A/B| C | 
+----+----+-------------+
|  1 |   0|   1|   1|  2| 
|  2 |   0|   2|   4|  0| 
+----+----+----+----+---+

I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:

+----+----+----+
|user| A/B|   C| 
+----+----+----+
|  1 |   1|   3| 
|  2 |   4|   2| 
+----+----+----+

Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?

Answer 1

我有一个解决这个问题的方法

val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)

现在进行加入，然后out将包含具有不同名称的值

然后将列表的元组组合起来

val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))

他们将每个元组中的列值组合在一起，得到所需的输出！

PS：我猜你可以写出合并的逻辑！所以我不是勺子喂！

Add columns on a Pyspark Dataframe

1 个答案: