如何使用spark中的逗号分隔符将相同的列值连接到新列

时间:2016-10-13 13:44:49

标签: scala apache-spark apache-spark-sql

输入数据的格式如下:

+--------------------+-------------+--------------------+
|           date     |       user  |           product  |
+--------------------+-------------+--------------------+
|        2016-10-01  |        Tom  |           computer |
+--------------------+-------------+--------------------+
|        2016-10-01  |        Tom  |           iphone   |
+--------------------+-------------+--------------------+
|        2016-10-01  |       Jhon  |             book   |
+--------------------+-------------+--------------------+
|        2016-10-02  |        Tom  |             pen    |
+--------------------+-------------+--------------------+
|        2016-10-02  |       Jhon  |             milk   |
+--------------------+-------------+--------------------+

输出格式如下:

+-----------+-----------------------+
|     user  |        products       |
+-----------------------------------+
|     Tom   |   computer,iphone,pen |
+-----------------------------------+
|     Jhon  |          book,milk    |  
+-----------------------------------+

输出显示每个用户按日期订购的所有产品。

我想使用Spark处理这些数据,你能帮我吗?谢谢。

3 个答案:

答案 0 :(得分:2)

最好使用map-reduceBykey()组合而不是groupBy ..还假设数据没有

#Read the data using val ordersRDD = sc.textFile("/file/path")
val ordersRDD = sc.parallelize( List(("2016-10-01","Tom","computer"), 
    ("2016-10-01","Tom","iphone"), 
    ("2016-10-01","Jhon","book"), 
    ("2016-10-02","Tom","pen"), 
    ("2016-10-02","Jhon","milk")))

#group by (date, user), sort by key & reduce by user & concatenate products
val dtusrGrpRDD = ordersRDD.map(rec => ((rec._2, rec._1), rec._3))
   .sortByKey().map(x=>(x._1._1, x._2))
   .reduceByKey((acc, v) => acc+","+v)

#if needed, make it to DF
scala> dtusrGrpRDD.toDF("user", "product").show()
+----+-------------------+
|user|            product|
+----+-------------------+
| Tom|computer,iphone,pen|
|Jhon|          book,milk|
+----+-------------------+

答案 1 :(得分:1)

如果您正在使用HiveContext(您应该使用):

使用python的示例:

from pyspark.sql.functions import collect_set

df = ... load your df ...
new_df = df.groupBy("user").agg(collect_set("product").alias("products"))

如果您不希望重复产品中的结果列表,则可以使用collect_list。

答案 2 :(得分:0)

对于数据框,它是两层的:

t

GroupBy将为您提供分组的用户集帖子,您可以对其进行迭代并在分组的数据集上进行collect_list或collect_set。