Spark count()耗时很长

时间:2017-07-06 14:40:56

标签: apache-spark pyspark

我试图在PySpark上做一些计算。首先,我只对数据进行行计数,该数据已经划分为多个分区(在使用sqoop运行查询之后)。

第一次尝试时,我有25个数据分区,并在5个执行器上运行计算,每个执行器有3个核心,每个执行器有15G内存。此外,我有大约2GB的数据(或多或少92M行)。

但是,使用这些设置后,一个简单的sc.textFile()后跟一个count()需要花费3分钟才能运行,这似乎有些过分。所以我做了一些查找,并更改了一些设置,测试了几种组合。

现在我正在使用10个执行程序,5个内核,20GB内存和150个分区。这对我来说似乎有点过分,但是这给了我最好的表现,在40秒后完成所有事情。但是,对于我专注于此的资源数量以及我拥有的数据量,它似乎仍然更快......

我还尝试在cache()之后使用sc.textFile()来查看它是否会提高性能(因为,我告诉自己,读取文件可能是导致性能降低的原因),但是它实际上减慢了事情,当我执行cache()时,某些执行程序甚至失败,并显示以下消息:

Container killed by YARN for exceeding memory limits. 22.4 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead

现在,我也非常想了解这些22.4GB的来源,考虑到我的文件总和2GB的数据(除非sqoop进行某种压缩,当数据加载时撤消,而我不知道在这一点上,我只是抓住了

我还能做些什么来尝试提高性能?或者我可以运行某种形式的诊断来更好地查明问题的原因?

编辑:

根据要求,输入文件是sqoop在运行查询后生成的.deflated文件(换句话说,内容本身就是我们的胡言乱语)。关于输入的最重要的细节是,它总共有大约92M行,每行有204列,并且非常稀疏。但是,当它在开头被读取时,它被读取为字符串,而不是直接作为204个值的数组。一个例子是:

u'20,200,0,1,0,1,0,0,36.228189024369268,1,0,1,1,0,0,1,0,0,0,0,1,1,1,1,1,18.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.333333,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.333333,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.333333,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0,0,22,0,0,0'

至于代码,它确实没什么用,它实际上只是读取文件并计算它:

t = sc.textFile("/user/bc/input/proj1/train")#.cache()
t.count()

其中/user/bc/input/proj1/train是所有150个.deflate文件所在的目录(因此,所有150个分区)。

更新:

另外,出于某种原因,使用cache()重新运行代码今天工作时没有给我一个内存错误。似乎.deflate确实压缩文件,因为UI告诉我73%的RDD已被缓存,内存为5.2GB。不过,距离昨天的22.4GB还有很大差距。 尽管如此,使用缓存的RDD运行计数仍然比以前(52秒)给出了最差的结果。

UPDATE2:

所以,我仍然无法解决这个问题,但我可以提供更多细节......不知道他们会提供多少帮助,但无论如何,值得一试:

  • 更改(再次)配置。现在使用8个执行程序,每个4核,20GB执行程序内存,5GB驱动程序内存。数据集分为32个分区(因此,每个核心一个)。我决定减少执行者的数量,因为它是一个共享集群,我基本上把所有资源都拿给自己大声笑,所以,是的。是的。
  • 减少分区中的偏斜。所以,截至目前,我每个文件大约有80MB。这有助于使任务运行得更快一点(偏差真的不是那么大),但总的来说还是很慢。
  • 通过查看Spark UI中的阶段信息,指示的所有时间(调度程序延迟,反序列化时间,GC时间,结果序列化时间,获取结果时间)似乎都是正确的。对于几乎所有任务,反序列化大约为0.5秒,并且GC需要大约0.3秒,但有一些例外(一个或两个任务需要1秒用于GC)。所有其他的都是毫秒或更短的数量级。
  • 唯一需要很长时间的是"持续时间" (中位数为35s,最长为1分钟),如果我解释正确的话,意味着Spark需要很长时间来执行计数,而不是管理传递的信息,反序列化等等。

由于我尝试了很多不同的配置,似乎没有任何工作,我真的没有想到可能导致这种缓慢的原因......也许它与某些事情有关集群本身,不一定是Spark?我们非常欢迎任何有关进一步澄清的建议和/或要求: - )

0 个答案:

没有答案