Spark:读取数据帧时出错

时间:2016-09-08 13:18:57

标签: python apache-spark pyspark spark-dataframe

我正在尝试阅读我创建的Spark DataFrame的第一行,如下所示:

#read file    
datasetDF = sqlContext.read.format('com.databricks.spark.csv').options(delimiter=';', header='true',inferschema='true').load(dataset)
#vectorize
ignore = ['value']
vecAssembler = VectorAssembler(inputCols=[x for x in datasetDF.columns if x not in ignore], outputCol="features")
#split training - test set
(split20DF, split80DF) = datasetDF.randomSplit([1.0, 4.0],seed)
testSetDF = split20DF.cache()
trainingSetDF = split80DF.cache()

print trainingSetDF.take(5)

但是,如果我运行此代码,我会收到以下错误(由最后一行 print trainingSetDF.take(5)引起):

: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 3.0 failed 4 times, most recent failure: 
Lost task 0.3 in stage 3.0 (TID 7, 192.168.101.102): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile:
org.codehaus.janino.JaninoRuntimeException:
 Code of method "
(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class
 **"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB**

我需要补充一点,只有当我有很多功能(超过256个)时才会发生这种情况。 我做错了什么?

谢谢, Florent的

1 个答案:

答案 0 :(得分:0)

如果您有一个可用于加入数据的ID变量,我找到了解决宽数据上的randomSplit()错误的问题(如果您不能在分割之前轻松制作一个并且仍然使用解决方法)。确保为拆分部分创建单独的变量名称,并且不要使用相同的变量名称(我使用train1 / valid1),否则你会得到相同的错误,因为我认为它只是指向同一个指针RDD。这可能是我见过的最愚蠢的错误之一。我们的数据甚至不是那么广泛。

Y            = 'y'
ID_VAR       = 'ID'
DROPS        = [ID_VAR]

train = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://emr-related-files/train.csv')
test = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://emr-related-files/test.csv')
train.show()
print(train.count)

(train1,valid1) = train.select(ID_VAR).randomSplit([0.7,0.3], seed=123)
train = train1.join(train,ID_VAR,'inner')
valid = valid1.join(train, ID_VAR,'inner')

train.show()