如何在PySpark中爆炸?

时间:2017-04-11 23:17:54

标签: apache-spark pyspark apache-spark-sql

我们说我有一个DataFrame列有一个用户列,另一列有他们写过的词:

Row(user='Bob', word='hello')
Row(user='Bob', word='world')
Row(user='Mary', word='Have')
Row(user='Mary', word='a')
Row(user='Mary', word='nice')
Row(user='Mary', word='day')

我想将word列聚合到一个向量中:

Row(user='Bob', words=['hello','world'])
Row(user='Mary', words=['Have','a','nice','day'])

我似乎无法使用任何Sparks分组功能,因为他们期望后续的聚合步骤。我的用例是我想将这些数据提供给Word2Vec而不使用其他Spark聚合。

5 个答案:

答案 0 :(得分:12)

感谢@titipat提供RDD解决方案。我在帖子后不久意识到 实际上是使用collect_set(或collect_list)的DataFrame解决方案:

from pyspark.sql import Row
from pyspark.sql.functions import collect_set
rdd = spark.sparkContext.parallelize([Row(user='Bob', word='hello'),
                                      Row(user='Bob', word='world'),
                                      Row(user='Mary', word='Have'),
                                      Row(user='Mary', word='a'),
                                      Row(user='Mary', word='nice'),
                                      Row(user='Mary', word='day')])
df = spark.createDataFrame(rdd)
group_user = df.groupBy('user').agg(collect_set('word').alias('words'))
print(group_user.collect())

>[Row(user='Mary', words=['Have', 'nice', 'day', 'a']), Row(user='Bob', words=['world', 'hello'])]

答案 1 :(得分:6)

from pyspark.sql import functions as F

df.groupby("user").agg(F.collect_list("word"))

答案 2 :(得分:4)

以下是使用rdd的解决方案。

from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([Row(user='Bob', word='hello'),
                                      Row(user='Bob', word='world'),
                                      Row(user='Mary', word='Have'),
                                      Row(user='Mary', word='a'),
                                      Row(user='Mary', word='nice'),
                                      Row(user='Mary', word='day')])
group_user = rdd.groupBy(lambda x: x.user)
group_agg = group_user.map(lambda x: Row(**{'user': x[0], 'word': [t.word for t in x[1]]}))
来自group_agg.collect()

输出

[Row(user='Bob', word=['hello', 'world']),
Row(user='Mary', word=['Have', 'a', 'nice', 'day'])]

答案 3 :(得分:2)

从spark 2.3版本开始,我们现在有了Pandas UDF(又名矢量化UDF)。下面的函数将完成OP的任务...使用此函数的好处是可以确保保留顺序。在许多情况下,例如时间序列分析,顺序至关重要。

import pandas as pd
import findspark

findspark.init()
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, ArrayType

spark = SparkSession.builder.appName('test_collect_array_grouped').getOrCreate()

def collect_array_grouped(df, groupbyCols, aggregateCol, outputCol):
    """
    Aggregate function: returns a new :class:`DataFrame` such that for a given column, aggregateCol,
    in a DataFrame, df, collect into an array the elements for each grouping defined by the groupbyCols list.
    The new DataFrame will have, for each row, the grouping columns and an array of the grouped
    values from aggregateCol in the outputCol.

    :param groupbyCols: list of columns to group by.
            Each element should be a column name (string) or an expression (:class:`Column`).
    :param aggregateCol: the column name of the column of values to aggregate into an array
            for each grouping.
    :param outputCol: the column name of the column to output the aggregeted array to.
    """
    groupbyCols = [] if groupbyCols is None else groupbyCols
    df = df.select(groupbyCols + [aggregateCol])
    schema = df.select(groupbyCols).schema
    aggSchema = df.select(aggregateCol).schema
    arrayField = StructField(name=outputCol, dataType=ArrayType(aggSchema[0].dataType, False))
    schema = schema.add(arrayField)
    @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
    def _get_array(pd_df):
        vals = pd_df[groupbyCols].iloc[0].tolist()
        vals.append(pd_df[aggregateCol].values)
        return pd.DataFrame([vals])
    return df.groupby(groupbyCols).apply(_get_array)

rdd = spark.sparkContext.parallelize([Row(user='Bob', word='hello'),
                                      Row(user='Bob', word='world'),
                                      Row(user='Mary', word='Have'),
                                      Row(user='Mary', word='a'),
                                      Row(user='Mary', word='nice'),
                                      Row(user='Mary', word='day')])
df = spark.createDataFrame(rdd)

collect_array_grouped(df, ['user'], 'word', 'users_words').show()

+----+--------------------+
|user|         users_words|
+----+--------------------+
|Mary|[Have, a, nice, day]|
| Bob|      [hello, world]|
+----+--------------------+

答案 4 :(得分:1)

为此,您有一个本机聚合函数collect_set(文档here)。

然后,您可以使用:

from pyspark.sql import functions as F
df.groupby("user").agg(F.collect_set("word"))