以下是我的代码的想法:
我有大量的电子邮件数据RDD,名为email
。大约7亿封电子邮件。它看起来像这样:
[['value1','value2','value3','value4'],['recipient1','recipient2','recipient3'],['sender']]
email
中有超过40,000个不同的recipeint和发件人电子邮件地址。我有一份我感兴趣的600个电子邮件地址列表,如下所示:
relevant_emails = ['rel_email1','rel_email2','rel_email3',...,'rel_email600']
我想遍历我的大型RDD email
,只保留发件人和收件人都属于relevant_emails
列表的电子邮件。因此,我广播了related_emails,以便每个工作节点都有一个副本:broadcast_emails = sc.broadcast(relevant_emails)
。
以下是我要应用于email
中每一行的函数:
def get_relevant_emails(row):
r_bool = False
s_bool = False
recipients = row[1]
sender = row[2]
if sender[0] in broadcast_emails.value:
s_bool = True
for x in range(0, len(recipients)):
if recipients[x] in broadcast_emails.value:
r_bool = True
break
if (r_bool is True and s_bool is True):
return row
我遇到的问题是,当我运行emails.map(lambda row: get_relevant_emails(row))
(然后用强制执行它的东西(例如saveAsTextFile()
)跟进它时,它开始运行,然后发送:< / p>
WARN: Stage 5 contains a task of very large size (xxxx KB). The maximum recommended task size is 100 KB
然后它停止运行。仅供参考:我在Spark shell中运行它,有20个执行程序,每个执行程序10GB内存,每个执行程序3个内核。就HDFS的块存储消耗而言,email
的大小为76.7 GB,我已经在600个分区(76.7 GB / 128 MB)中获得了它。
答案 0 :(得分:1)
警告所指的任务大小可能是由于get_relevant_emails()函数中分配的变量数量所致。任务大小超出建议最大大小的另一种方法是引用函数范围之外的其他变量。
在任何情况下,我都建议使用DataFrame API,因为它使这个操作更简单,并且它将表现更好。它更快,因为它可以完成Java中的所有繁重工作,并避免在Python和Java Vms中来回编组数据。我和我的团队将大部分现有的python逻辑移植到SparkSQL和DataFrames中,并且看到了大量的性能提升。
以下是它如何适用于您的情况:
from pyspark import SparkContext, SQLContext
from pyspark.sql.functions import broadcast, expr
sc = SparkContext()
sql_ctx = SQLContext(sc)
email = [
[['value1','value2','value3','value4'],['recipient1','recipient2','recipient3'],['sender1']],
[['value1','value2','value3','value4'],['recipient1','recipient2','recipient3'],['sender2']],
[['value1','value2','value3','value4'],['recipient1','recipient4','recipient5'],['sender3']]
]
relevant_addresses = [
["sender2"],
["sender3"],
["recipient3"]
]
email_df = sql_ctx.createDataFrame(email, ["values", "recipients", "sender"])
relevant_df = sql_ctx.createDataFrame(relevant_addresses, ["address"])
broadcasted_relevant = broadcast(relevant_df)
result = email_df.join(
broadcasted_relevant,
on=expr("array_contains(recipients, address) OR array_contains(sender, address)"),
how="leftsemi"
)
result.collect()
这里的左半连接就像一个过滤器,只选择来自email_df的匹配行。它与SQL中使用“WHERE IN”子句时发生的连接类型相同。