我有一个python spark程序,它表现不一致,在某些情况下会出错
我经常使用c3.2xlarge
两个m1.large
的小型EMR集群运行,它运行正常并且竞争成功。
但是,当我在较大的群集上运行完全相同的程序时 - 我尝试使用c3.2xlarge
主机4 m1.large
,它以错误结束。我将粘贴下面的错误,但即使这些错误也不是一致的,不是错误跟踪本身,也不是发生错误的阶段。
例如,在一个案例中,它发生在大约26分钟后和.count()
调用中,并且在另一个实例中它实际上成功通过.count()
,但是在大约一个小时之后并且在不同阶段发生错误,致电.write.jdbc()
所以我认为这是一种竞争条件,但我甚至不确定这是不是因为火花使用不当造成的,或者这是火花中的错误。
我在这种情况下使用的大部分功能都来自spark.sql。
环境:EMR上的Spark 1.5.2(AWS上的弹性Mapreduce)
堆栈痕迹很长,所以我不能在这里粘贴整个痕迹,但希望能够获得上下文。
代码本身 - 嗯,那里有很多,我没有设法找到一个简单的repro测试用例,我可以轻松地在这里发布...(竞争条件,你知道......)
如上所述,这只是堆栈跟踪的一部分,它变得非常长(例一):
请注意,两种情况下的错误都发生在代码的不同位置。
如何解决此问题的任何帮助或指示?
欢呼声
Traceback (most recent call last):
File "/home/hadoop/rantav.spark_normalize_data.py.134231/spark_normalize_data.py", line 102, in <module>
run_spark(args)
File "/home/hadoop/rantav.spark_normalize_data.py.134231/spark_normalize_data.py", line 62, in run_spark
company_aliases_broadcast, experiences, args)
File "/home/hadoop/rantav.spark_normalize_data.py.134231/companies.py", line 50, in get_companies
out(sc, args, 'companies', sql_create_table, companies)
File "/home/hadoop/rantav.spark_normalize_data.py.134231/output.py", line 48, in out
mode='append')
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 455, in jdbc
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o464.jdbc.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenExchange hashpartitioning(ref_company_size_id#96)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,churn_rate#200,churn_rate_percentile#202,retention_rate_2y#210]
SortMergeOuterJoin [id#85], [company_id#180], LeftOuter, None
TungstenSort [id#85 ASC], false, 0
TungstenExchange hashpartitioning(id#85)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,churn_rate#200,(_we0#203 * 100.0) AS churn_rate_percentile#202]
Window [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,churn_rate#200], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentRank(churn_rate#200) WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS _we0#203], [ref_company_size_id#96], [churn_rate#200 ASC]
TungstenSort [ref_company_size_id#96 ASC,churn_rate#200 ASC], false, 0
TungstenExchange hashpartitioning(ref_company_size_id#96)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,(100.0 * cast(pythonUDF#201 as double)) AS churn_rate#200]
!BatchPythonEvaluation PythonUDF#divide(count#199L,emp_count#94L), [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,count#199L,pythonUDF#201]
ConvertToSafe
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,count#199L]
SortMergeOuterJoin [id#85], [company_id#180], LeftOuter, None
TungstenSort [id#85 ASC], false, 0
TungstenExchange hashpartitioning(id#85)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,(_we0#198 * 100.0) AS avg_tenure_percentile#197]
Window [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentRank(avg_tenure#196) WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS _we0#198], [ref_company_size_id#96], [avg_tenure#196 ASC]
TungstenSort [ref_company_size_id#96 ASC,avg_tenure#196 ASC], false, 0
TungstenExchange hashpartitioning(ref_company_size_id#96)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg(duration_days)#195 AS avg_tenure#196]
SortMergeOuterJoin [id#85], [company_id#180], LeftOuter, None
TungstenSort [id#85 ASC], false, 0
TungstenExchange hashpartitioning(id#85)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,(_we0#194 * 100.0) AS growth_rate_percentile#193]
Window [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentRank(growth_rate#191) WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS _we0#194], [ref_company_size_id#96], [growth_rate#191 ASC]
TungstenSort [ref_company_size_id#96 ASC,growth_rate#191 ASC], false, 0
TungstenExchange hashpartitioning(ref_company_size_id#96)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,(100.0 * cast(pythonUDF#192 as double)) AS growth_rate#191]
!BatchPythonEvaluation PythonUDF#divide(count#190L,emp_count#94L), [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,count#190L,pythonUDF#192]
ConvertToSafe
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,count#190L]
SortMergeOuterJoin [id#85], [company_id#180], LeftOuter, None
TungstenSort [id#85 ASC], false, 0
TungstenExchange hashpartitioning(id#85)
TungstenProject [id#85,href#86,name#87,emp_count#94L,(_we0#176 * 100.0) AS emp_count_percentile#175,ref_company_size_id#96]
Window [id#85,href#86,name#87,emp_count#94L,ref_company_size_id#96], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentRank(emp_count#94L) WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS _we0#176], [ref_company_size_id#96], [emp_count#94L ASC]
TungstenSort [ref_company_size_id#96 ASC,emp_count#94L ASC], false, 0
这是另一个不同堆栈跟踪的例子(相同的程序,相同数量的工人):
/home/hadoop/rantav.spark_normalize_data.py.093920/pymysql/cursors.py:146: Warning: Can't create database 'v2'; database exists
result = self._query(query)
/home/hadoop/rantav.spark_normalize_data.py.093920/pymysql/cursors.py:146: Warning: Table 'oplog' already exists
result = self._query(query)
Traceback (most recent call last):
File "/home/hadoop/rantav.spark_normalize_data.py.093920/spark_normalize_data.py", line 102, in <module>
run_spark(args)
File "/home/hadoop/rantav.spark_normalize_data.py.093920/spark_normalize_data.py", line 62, in run_spark
company_aliases_broadcast, experiences, args)
File "/home/hadoop/rantav.spark_normalize_data.py.093920/companies.py", line 50, in get_companies
out(sc, args, 'companies', sql_create_table, companies)
File "/home/hadoop/rantav.spark_normalize_data.py.093920/output.py", line 35, in out
if data.count() > 0:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 268, in count
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o469.count.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#216L])
TungstenExchange SinglePartition
TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#219L])
TungstenProject
Window [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,churn_rate#200,churn_rate_percentile#202,retention_rate_2y#210], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentRank(retention_rate_2y#210) WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS _we0#213], [ref_company_size_id#96], [retention_rate_2y#210 ASC]
TungstenSort [ref_company_size_id#96 ASC,retention_rate_2y#210 ASC], false, 0
TungstenExchange hashpartitioning(ref_company_size_id#96)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,churn_rate#200,churn_rate_percentile#202,retention_rate_2y#210]
SortMergeOuterJoin [id#85], [company_id#180], LeftOuter, None
TungstenSort [id#85 ASC], false, 0
TungstenExchange hashpartitioning(id#85)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,churn_rate#200,(_we0#203 * 100.0) AS churn_rate_percentile#202]
Window [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,churn_rate#200], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentRank(churn_rate#200) WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS _we0#203], [ref_company_size_id#96], [churn_rate#200 ASC]
TungstenSort [ref_company_size_id#96 ASC,churn_rate#200 ASC], false, 0
TungstenExchange hashpartitioning(ref_company_size_id#96)
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,(100.0 * cast(pythonUDF#201 as double)) AS churn_rate#200]
!BatchPythonEvaluation PythonUDF#divide(count#199L,emp_count#94L), [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,count#199L,pythonUDF#201]
ConvertToSafe
TungstenProject [id#85,href#86,name#87,emp_count#94L,emp_count_percentile#175,ref_company_size_id#96,growth_rate#191,growth_rate_percentile#193,avg_tenure#196,avg_tenure_percentile#197,count#199L]
SortMergeOuterJoin [id#85], [company_id#180], LeftOuter, None
TungstenSort [id#85 ASC], false, 0
TungstenExchange hashpartitioning(id#85)