Question

我正在尝试使用Scala和Spark从csv读取数据，但是列的值为空。

我试图从csv中读取数据。我还提供了一个架构轻松查询数据。

private val myData= sparkSession.read.schema(createDataSchema).csv("data/myData.csv")

def createDataSchema = {
    val schema = StructType(
      Array(
        StructField("data_index",StringType, nullable = false),
        StructField("property_a",IntegerType, nullable = false),
        StructField("property_b",IntegerType, nullable = false),
        //some other columns
     )
   )

   schema

查询数据：

val myProperty= accidentData.select($"property_b")
myProperty.collect()

我希望数据以某些值的列表形式返回

，但它们作为包含空值的列表返回（值为空）。为什么？

当我打印架构时，nullable设置为true而不是false。

我正在使用Scala 2.12.9和Spark 2.4.3。

Answer 1

虽然通过提供模式为nullable = false从CSV文件加载数据，但Still Spark仍将模式覆盖为nullable = true，以便可以在数据加载期间避免使用空指针。

让我们举个例子，假设CSV文件有两行，第二行的列值为空或空。

CSV:
a,1,2
b,,2

如果nullable = false，则在调用数据帧上的操作时，在加载数据时会抛出空指针异常，因为要加载的是空值/空值，并且没有默认值，因此会抛出空指针。因此，为避免出现这种情况，Spark会将其覆盖为nullable = true。

但是，可以通过将所有null替换为默认值，然后重新应用架构来解决此问题。

val df = spark.read.schema(schema).csv("data/myData.csv")
val dfWithDefault = df.withColumn("property_a", when(col("property_a").isNull, 0).otherwise(df.col("property_a")))
val dfNullableFalse = spark.sqlContext.createDataFrame(dfWithDefault.rdd, schema)
dfNullableFalse.show(10)

df.printSchema()
root
|-- data_index: string (nullable = true)
|-- property_a: integer (nullable = true)
|-- property_b: integer (nullable = true)

dfNullableFalse.printSchema()
root
|-- data_index: string (nullable = false)
|-- property_a: integer (nullable = false)
|-- property_b: integer (nullable = false)

从csv读取数据返回空值

1 个答案: