我试图在PySpark上做一些计算。首先,我只对数据进行行计数,该数据已经划分为多个分区(在使用sqoop运行查询之后)。
第一次尝试时,我有25个数据分区,并在5个执行器上运行计算,每个执行器有3个核心,每个执行器有15G内存。此外,我有大约2GB的数据(或多或少92M行)。
但是,使用这些设置后,一个简单的sc.textFile()
后跟一个count()
需要花费3分钟才能运行,这似乎有些过分。所以我做了一些查找,并更改了一些设置,测试了几种组合。
现在我正在使用10个执行程序,5个内核,20GB内存和150个分区。这对我来说似乎有点过分,但是这给了我最好的表现,在40秒后完成所有事情。但是,对于我专注于此的资源数量以及我拥有的数据量,它似乎仍然更快......
我还尝试在cache()
之后使用sc.textFile()
来查看它是否会提高性能(因为,我告诉自己,读取文件可能是导致性能降低的原因),但是它实际上减慢了事情,当我执行cache()
时,某些执行程序甚至失败,并显示以下消息:
Container killed by YARN for exceeding memory limits. 22.4 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
现在,我也非常想了解这些22.4GB的来源,考虑到我的文件总和2GB的数据(除非sqoop进行某种压缩,当数据加载时撤消,而我不知道在这一点上,我只是抓住了
我还能做些什么来尝试提高性能?或者我可以运行某种形式的诊断来更好地查明问题的原因?
编辑:
根据要求,输入文件是sqoop在运行查询后生成的.deflated文件(换句话说,内容本身就是我们的胡言乱语)。关于输入的最重要的细节是,它总共有大约92M行,每行有204列,并且非常稀疏。但是,当它在开头被读取时,它被读取为字符串,而不是直接作为204个值的数组。一个例子是:
u'20,200,0,1,0,1,0,0,36.228189024369268,1,0,1,1,0,0,1,0,0,0,0,1,1,1,1,1,18.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.333333,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.333333,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.333333,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0,0,22,0,0,0'
至于代码,它确实没什么用,它实际上只是读取文件并计算它:
t = sc.textFile("/user/bc/input/proj1/train")#.cache()
t.count()
其中/user/bc/input/proj1/train
是所有150个.deflate文件所在的目录(因此,所有150个分区)。
更新:
另外,出于某种原因,使用cache()
重新运行代码今天工作时没有给我一个内存错误。似乎.deflate确实压缩文件,因为UI告诉我73%的RDD已被缓存,内存为5.2GB。不过,距离昨天的22.4GB还有很大差距。
尽管如此,使用缓存的RDD运行计数仍然比以前(52秒)给出了最差的结果。
UPDATE2:
所以,我仍然无法解决这个问题,但我可以提供更多细节......不知道他们会提供多少帮助,但无论如何,值得一试:
由于我尝试了很多不同的配置,似乎没有任何工作,我真的没有想到可能导致这种缓慢的原因......也许它与某些事情有关集群本身,不一定是Spark?我们非常欢迎任何有关进一步澄清的建议和/或要求: - )