Question

我有一个数据帧（大约50 gb）和3个列表（几百个元素，每个元素总数为445005）。

我需要检查列url中的值是否与3个列表中的任何组合匹配，并返回此组合。我是这样做的。

def checkMatch(query1:List[String], query2:List[String], model:List[String]):List[(String, String, String)]= {
for{
x <- query1
y <- query2
z <- model
if(url.contains(x) && url.contains(y) && url.contains(z))
} yield (x,y,z)
}

此失败。因为，应用程序不会停止，但每个任务都以

失败

ExecutorLostFailure (executor 26 exited unrelated to the running tasks) Reason: Container container on host: host was preempted.

应用程序一直运行直到我杀了它，没有任何任务完成。

我发现的所有错误都表明缺乏记忆力。我的配置是

spark-submit \
--class Main \
--master yarn \
--deploy-mode client \
--num-executors 200 \
--executor-cores 10 \
--driver-memory 4G \
--executor-memory 8G \
--files hive-site.xml#hive-site.xml \
--conf spark.task.maxFailures=10 \
--conf spark.executor.memory=8G \
--conf spark.app.name=spark-job \
--conf spark.yarn.executor.memoryOverhead=4096 \
--conf spark.yarn.driver.memoryOverhead=2048 \
--conf spark.shuffle.service.enabled=true \
--conf spark.shuffle.consolidateFiles=true \
--conf spark.broadcast.compress=true \
--conf spark.shuffle.compress=true \
--conf spark.shuffle.spill.compress=true \
--conf spark.network.timeout=10000000 \
--conf spark.executor.heartbeatInterval=10000000 \

我尝试分配2-3倍的内存。它没有帮助。

还有哪些其他解决方案？ Andis实际上缺乏记忆的原因是什么？

Answer 1

此代码发生在驱动程序上，因为您没有指定任何特殊关键字，例如map或reduce，这就是为什么这么长时间。因为它们只是字符串，所以这些列表可以扁平化为1个列表吗？它会使map-reduce更容易。

过滤大量字符串

1 个答案: