PySpark serialization EOFError

时间:2016-04-12 00:57:00

标签: python apache-spark pyspark apache-spark-1.6

I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically reducing the size of the DataFrame didn't prevent the EOF error.

Toy code and error below.

#set spark context
conf = SparkConf().setMaster("local").setAppName("MyApp")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)

#read in 500mb csv as DataFrame
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
     inferschema='true').load('myfile.csv')

#get dataframe into machine learning format
r_formula = RFormula(formula = "outcome ~ .")
mldf = r_formula.fit(df).transform(df)

#fit random forest model
rf = RandomForestClassifier(numTrees = 3, maxDepth = 2)
model = rf.fit(mldf)
result = model.transform(mldf).head()

Running the above code with spark-submit on a single node repeatedly throws the following error, even if the size of the DataFrame is reduced prior to fitting the model (e.g. tinydf = df.sample(False, 0.00001):

Traceback (most recent call last):
  File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 157, 
     in manager
  File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/daemon.py", line 61, 
     in worker
  File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/worker.py", line 136, 
     in main if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/home/hduser/spark1.6/python/lib/pyspark.zip/pyspark/serializers.py", line 545, 
     in read_int
    raise EOFError
  EOFError

3 个答案:

答案 0 :(得分:1)

错误似乎发生在pySpark read_int函数中。代码的编号如下spark site

def read_int(stream):
length = stream.read(4)
if not length:
    raise EOFError
return struct.unpack("!i", length)[0]

这意味着当从流中读取4字节时,如果读取0字节,则会引发EOF错误。 python文档是here

答案 1 :(得分:1)

您是否检查过EOError在代码中出现的位置?

我的猜测是,当您尝试使用df定义df时,它即将出现,因为这是代码中实际上试图读取文件的唯一位置。

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
     inferschema='true').load('myfile.csv')

此行之后的每一点,您的代码都在使用变量df,而不是文件本身,因此,似乎该行正在生成错误。

测试是否存在这种情况的一种简单方法是注释掉其余的代码,和/或在上面的行之后放置这样的行。

print(len(df))

另一种方法是使用try循环,例如:

try:
    df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
     inferschema='true').load('myfile.csv')
except:
    print('Didn't load file into df!')

如果事实证明该行是生成EOFError的那一行,那么您永远不会首先获得数据帧,因此尝试减少它们不会有任何作用。

如果那是产生错误的行,则想到两种可能性:

1)您的代码较早地调用了一个或两个.csv文件,并且没有在此行之前关闭它。如果是这样,只需在此处将其关闭即可。

2).csv文件本身存在问题。尝试将它们加载到此代码之外,并首先使用csv.reader之类的内容查看是否可以将它们正确地放入内存,然后以您期望的方式对其进行操作。

答案 2 :(得分:0)

我遇到了同样的问题,并且不知道如何调试它。看来这将导致执行程序线程卡住,并且永不返回任何内容。