使用pyspark将函数应用于groupBy数据

时间:2016-12-05 20:50:31

标签: apache-spark pyspark

我正在尝试在另一列上分组时从csv获取字数。我的csv有三列:id,message和user_id。我读了这个,然后拆分消息并存储一份unigrams列表:

df

接下来,根据我的数据框user_id,我希望按user_id进行分组,然后获取每个unigrams的计数。作为一个简单的第一遍,我尝试按from collections import Counter from pyspark.sql.types import ArrayType, StringType, IntegerType from pyspark.sql.functions import udf df = self.session.read.csv(self.corptable, header=True, mode="DROPMALFORMED",) # split my messages .... # message is now ArrayType(StringType()) grouped = df.groupBy(df["user_id"]) counter = udf(lambda l: len(l), ArrayType(StringType())) grouped.agg(counter(df["message"])) print(grouped.collect()) 进行分组,并获得分组消息字段的长度:

pyspark.sql.utils.AnalysisException: "expression '`message`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;"

我收到以下错误:

group_field = "user_id"
message_field = "message"

context = SparkContext()
session = SparkSession\
        .builder\
        .appName("dlastk")\
        .getOrCreate()

# add tokenizer
context.addPyFile(tokenizer_path)
from tokenizer import Tokenizer
tokenizer = Tokenizer()
spark_tokenizer = udf(tokenizer.tokenize, ArrayType(StringType()))

df = session.read.csv("myFile.csv", header=True,)
df = df[group_field, message_field]

# tokenize the message field
df = df.withColumn(message_field, spark_tokenizer(df[message_field]))

# create ngrams from tokenized messages
n = 1
grouped = df.rdd.map(lambda row: (row[0], Counter([" ".join(x) for x in zip(*[row[1][i:] for i in range(n)])]))).reduceByKey(add)

# flatten the rdd so that each row contains (group_id, ngram, count, relative frequency
flat = grouped.flatMap(lambda row: [[row[0], x,y, y/sum(row[1].values())] for x,y in row[1].items()])

# rdd -> DF
flat = flat.toDF()
flat.write.csv("myNewCSV.csv")

不确定如何解决此错误。一般情况下,如何在对另一列进行分组时将一个函数应用于一列?我是否总是要创建用户定义的功能? Spark的新手。

编辑:以下是我在单独的Python文件中给出了一个标记生成器的方法:

# after read
+--------------------+--------------------+
|             user_id|             message|
+--------------------+--------------------+
|00035fb0dcfbeaa8b...|To the douchebag ...|
|00035fb0dcfbeaa8b...|   T minus 1 week...|
|00035fb0dcfbeaa8b...|Last full day of ...|
+--------------------+--------------------+

# after tokenize
+--------------------+--------------------+
|             user_id|             message|
+--------------------+--------------------+
|00035fb0dcfbeaa8b...|[to, the, doucheb...|
|00035fb0dcfbeaa8b...|[t, minus, 1, wee...|
|00035fb0dcfbeaa8b...|[last, full, day,...|
+--------------------+--------------------+

# grouped: after 1grams extracted and Counters added
[('00035fb0dcfbeaa8bb70ffe24d614d4dcee446b803eb4063dccf14dd2a474611', Counter({'!': 545, '.': 373, 'the': 306, '"': 225, ...

# flat: after calculating sum and relative frequency for each 1gram
[['00035fb0dcfbeaa8bb70ffe24d614d4dcee446b803eb4063dccf14dd2a474611', 'face', 3, 0.000320547066994337], ['00035fb0dcfbeaa8bb70ffe24d614d4dcee446b803eb4063dccf14dd2a474611', 'was', 26, 0.002778074580617587] ....

# after flat RDD to DF
+--------------------+---------+---+--------------------+
|                  _1|       _2| _3|                  _4|
+--------------------+---------+---+--------------------+
|00035fb0dcfbeaa8b...|     face|  3| 3.20547066994337E-4|
|00035fb0dcfbeaa8b...|      was| 26|0.002778074580617587|
|00035fb0dcfbeaa8b...|      how| 22|0.002350678491291...|
+--------------------+---------+---+--------------------+

数据如下:

{{1}}

2 个答案:

答案 0 :(得分:13)

一种自然的方法可能是将单词分组到一个列表中,然后使用python函数Counter()生成单词计数。对于这两个步骤,我们将使用udf' s。首先,将展平由多个数组的collect_list()生成的嵌套列表的那个:

unpack_udf = udf(
    lambda l: [item for sublist in l for item in sublist]
)

其次,生成单词count tuples的单词,或者在我们的情况下struct生成单词:

from pyspark.sql.types import *
from collections import Counter

# We need to specify the schema of the return object
schema_count = ArrayType(StructType([
    StructField("word", StringType(), False),
    StructField("count", IntegerType(), False)
]))

count_udf = udf(
    lambda s: Counter(s).most_common(), 
    schema_count
)

全部放在一起:

from pyspark.sql.functions import collect_list

(df.groupBy("id")
 .agg(collect_list("message").alias("message"))
 .withColumn("message", unpack_udf("message"))
 .withColumn("message", count_udf("message"))).show(truncate = False)
+-----------------+------------------------------------------------------+
|id               |message                                               |
+-----------------+------------------------------------------------------+
|10100718890699676|[[oecd,1], [the,1], [with,1], [at,1]]                 |
|10100720363468236|[[what,3], [me,1], [sad,1], [to,1], [does,1], [the,1]]|
+-----------------+------------------------------------------------------+

数据:

df = sc.parallelize([(10100720363468236,["what", "sad", "to", "me"]),
                     (10100720363468236,["what", "what", "does", "the"]),
                     (10100718890699676,["at", "the", "oecd", "with"])]).toDF(["id", "message"])

答案 1 :(得分:1)

尝试:

Context.Entry(obj).CurrentValues.SetValues(data)