pyspark - 合并2列集

时间:2017-10-06 14:10:40

标签: apache-spark pyspark pyspark-sql

我有一个火花数据帧,它有两个由函数collect_set组成的列。我想将这两列集合组合成1列集合。我该怎么办?它们都是字符串集

对于实例,我通过调用collect_set

形成了2列
Fruits                  |    Meat
[Apple,Orange,Pear]          [Beef, Chicken, Pork]

如何将其转换为:

Food

[Apple,Orange,Pear, Beef, Chicken, Pork]

非常感谢您的帮助

3 个答案:

答案 0 :(得分:4)

我也在Python中解决这个问题,所以这里有一个Ramesh的Python解决方案的端口:

df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
                           ("Fruits", "Meat"))
df.show(1,False)

from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)

输出:

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+

感谢Ramesh!

编辑:请注意,您可能必须手动指定列类型(不确定为什么它仅在某些情况下对我有用而没有明确的类型规范 - 在其他情况下我得到一个字符串类型列)。

from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))

答案 1 :(得分:2)

鉴于你有dataframe

+---------------------+---------------------+
|Fruits               |Meat                 |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+

您可以编写udf函数将两列的集合合并为一个。

import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)

然后将udf函数调用为

df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)

您应该拥有所需的最终dataframe

+---------------------+---------------------+------------------------------------------+
|Fruits               |Meat                 |Food                                      |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+

答案 2 :(得分:0)

我们说df

+--------------------+--------------------+
|              Fruits|                Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+

然后

import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()

创建一组Fruits& Meat合并为一组,即

[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]


希望这有帮助!