我有一个带有一些社交媒体数据的大型csv:
message_id, user_id, message, date
"1", "123", "some message blah blah", "Sun May 12 15:08:58 +0000 2013"
"2", "123", "another message blah", "Sun June 12 15:08:58 +0000 2013"
"3", "123", "i want this message removed", "Sun June 12 15:08:58 +0000 2013"
"4", "321", "more blah", "Mon June 12 15:08:58 +0000 2013"
并希望根据组内的某些条件删除邮件(例如,该组可以是user_id
。
这就是我所做的:为我的排除标准创建了一个标准函数,根据此方法定义了udf
,然后将该函数应用于分组数据:
def exclusion_criteria(data_list):
keep = []
for d in data_list:
if some_condition:
keep.append(d)
return keep
myUdf = udf(exclusion_criteria, ArrayType(StringType()))
msgsDF = session.read.csv("data.csv", header=False)
filterMsgsDF = msgsDF.groupBy("user_id").agg(collect_list("message")
.alias("message")).withColumn("message",myUdf("message"))
最后我得到了一些看起来像的东西:
filterMsgsDF.take(1)
[Row(user_id='123', _c2=['some message blah blah', 'another message blah'])]
但问题是我删除了与每封邮件相关的信息(message_id
和date
)。我最终想要的是
["1", "123", "some message blah blah", "Sun May 12 15:08:58 +0000 2013"]
["2", "123", "another message blah", "Sun June 12 15:08:58 +0000 2013"]
["4", "321", "more blah", "Mon June 12 15:08:58 +0000 2013"]
有没有办法加入其他信息或在groupBy / agg步骤中保留?也许groupBy
不是最好的方法吗?
答案 0 :(得分:1)
类似的东西:
select m.id
from manufacturers M
where m.id IN (select t.manufacturerId From #tmp_manufacturers T)
输出:
filterMsgsDF = msgsDF.withColumn('message_list', collect_list(msgsDF['message']).over(Window.partitionBy('user_id')))