Question

我正在尝试根据条件组合火花数据框中的多行：

这是我拥有的数据框（df）：

|username | qid | row_no | text  |
 ---------------------------------
|  a      | 1   |  1     | this  |
|  a      | 1   |  2     |  is   |
|  d      | 2   |  1     |  the  |
|  a      | 1   |  3     | text  |
|  d      | 2   |  2     |  ball |

我希望它看起来像这样

|username | qid | row_no | text        |
 ---------------------------------------
|   a     | 1   |  1,2,3 | This is text|
|   b     | 2   |  1,2   | The ball    |

我正在使用spark 1.5.2它没有collect_list函数

Answer 1

n仅在1.6中出现。

我会浏览底层的RDD。方法如下：

collect_list

然后这个

data_df.show()
+--------+---+------+----+
|username|qid|row_no|text|
+--------+---+------+----+
|       d|  2|     2|ball|
|       a|  1|     1|this|
|       a|  1|     3|text|
|       a|  1|     2|  is|
|       d|  2|     1| the|
+--------+---+------+----+

以上产生了以下内容：

reduced = data_df\
    .rdd\
    .map(lambda row: ((row[0], row[1]), [(row[2], row[3])]))\
    .reduceByKey(lambda x,y: x+y)\
    .map(lambda row: (row[0], sorted(row[1], key=lambda text: text[0]))) \
    .map(lambda row: (
            row[0][0], 
            row[0][1], 
            ','.join([str(e[0]) for e in row[1]]),
            ' '.join([str(e[1]) for e in row[1]])
        )
    )

schema_red = typ.StructType([
        typ.StructField('username', typ.StringType(), False),
        typ.StructField('qid', typ.IntegerType(), False),
        typ.StructField('row_no', typ.StringType(), False),
        typ.StructField('text', typ.StringType(), False)
    ])

df_red = sqlContext.createDataFrame(reduced, schema_red)
df_red.show()

在熊猫中

+--------+---+------+------------+
|username|qid|row_no|        text|
+--------+---+------+------------+
|       d|  2|   1,2|    the ball|
|       a|  1| 1,2,3|this is text|
+--------+---+------+------------+

Answer 2

您可以在groupBy和username列上应用qid，然后使用agg()方法，您可以使用collect_list()这样的方法

import pyspark.sql.functions as func

然后您将拥有collect_list()或其他一些重要功能

对于详细信息abput groupBy和agg，您可以关注MatchCollection网址。

希望这能解决您的问题

由于

根据条件

2 个答案: