在PySpark中{group}之后保留数据帧的原始结构

时间:2017-07-07 15:58:41

标签: apache-spark dataframe pyspark

我有一个带有一些社交媒体数据的大型csv:

message_id, user_id, message, date
"1", "123", "some message blah blah", "Sun May 12 15:08:58 +0000 2013"
"2", "123", "another message blah", "Sun June 12 15:08:58 +0000 2013"
"3", "123", "i want this message removed", "Sun June 12 15:08:58 +0000 2013"
"4", "321", "more blah", "Mon June 12 15:08:58 +0000 2013"

并希望根据组内的某些条件删除邮件(例如,该组可以是user_id

这就是我所做的:为我的排除标准创建了一个标准函数,根据此方法定义了udf,然后将该函数应用于分组数据:

def exclusion_criteria(data_list):
    keep = []
    for d in data_list:
        if some_condition:
            keep.append(d)
    return keep

myUdf = udf(exclusion_criteria, ArrayType(StringType()))

msgsDF = session.read.csv("data.csv", header=False)
filterMsgsDF = msgsDF.groupBy("user_id").agg(collect_list("message")
    .alias("message")).withColumn("message",myUdf("message"))

最后我得到了一些看起来像的东西:

filterMsgsDF.take(1)
[Row(user_id='123', _c2=['some message blah blah', 'another message blah'])]

但问题是我删除了与每封邮件相关的信息(message_iddate)。我最终想要的是

["1", "123", "some message blah blah", "Sun May 12 15:08:58 +0000 2013"]
["2", "123", "another message blah", "Sun June 12 15:08:58 +0000 2013"]
["4", "321", "more blah", "Mon June 12 15:08:58 +0000 2013"]

有没有办法加入其他信息或在groupBy / agg步骤中保留?也许groupBy不是最好的方法吗?

1 个答案:

答案 0 :(得分:1)

类似的东西:

            select m.id
            from manufacturers M
            where m.id IN (select t.manufacturerId From #tmp_manufacturers T)

输出:

filterMsgsDF = msgsDF.withColumn('message_list', collect_list(msgsDF['message']).over(Window.partitionBy('user_id')))