在apache spark中使用CustomSchema时处理Null值

时间:2016-11-04 13:15:50

标签: apache-spark

我正在导入基于customSchema的数据,我已按以下方式定义

import org.apache.spark.sql.types.{StructType, StructField,DoubleType,StringType };
val customSchema_train = StructType(Array(
          StructField("x53",DoubleType,true),
          StructField("x95",DoubleType,true),
          StructField("x88",DoubleType,true),
          StructField("x30",DoubleType,true),
          StructField("x42",DoubleType,true),
          StructField("x28",DoubleType,true)
))

val train_orig = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(customSchema_train).option("nullValue","null").load("/....../train.csv").cache

现在我知道我的数据中有空值,因为" null"我试图相应地处理这个问题。导入发生时没有任何错误,但当我尝试描述数据时,我收到此错误

train_df.describe().show
SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 1 times, most recent failure: Lost task 0.0 in stage 46.0 (TID 56, localhost): java.text.ParseException: Unparseable number: "null"

如何处理此错误?

0 个答案:

没有答案