如何在Spark Dataset中将字符串值替换为int?

时间:2018-04-20 19:51:16

标签: apache-spark apache-spark-dataset

例如,输入数据:

1.0
\N

架构:

val schema = StructType(Seq(
  StructField("value", DoubleType, false)
))

阅读Spark数据集:

val df = spark.read.schema(schema)
    .csv("/path to csv file ")

当我使用这个数据集时,我会得到一个例外,因为“\ N”对于double是无效的。如何在此数据集中将“\ N”替换为0.0?感谢。

1 个答案:

答案 0 :(得分:0)

If data is malformed, don't use schema with inappropriate type. Define input as StringType:

val schema = StructType(Seq(
 StructField("value", StringType, false)
))

and cast data later:

val df = spark.read.schema(schema).csv("/path/to/csv/file")
  .withColumn("value", $"value".cast("double"))
  .na.fill(0.0)