Question

我在尝试使用Spark简单阅读CSV文件时遇到了问题。经过这样的操作后，我想确保：

数据类型正确（使用提供的架构）
标题对提供的架构正确

这是我使用的代码并且遇到问题：

val schema = Encoders.product[T].schema
val df = spark.read
 .schema(schema)
 .option("header", "true")
 .csv(fileName)

类型T的类型为Product，i。即案例类。这有效，但它不检查列名称是否正确，所以我可以给另一个文件，只要数据类型正确，就不会发生错误，我不知道用户提供的错误的文件，但与正确的数据类型和正确的排序有一些巧合。

我尝试使用推断架构的选项，然后在数据集上使用.as[T]方法，但是如果String以外的任何列只包含null，则Spark将其解释为String列，但在我的架构中它是Integer。因此会发生强制转换异常，但已经检查了列名称。

总结：我找到了解决方案，我可以确保正确的数据类型，但没有标题和其他解决方案，我可以验证标头，但有数据类型的问题。有没有办法实现两者，我。即标题和类型的完整验证？

我正在使用Spark 2.2.0。

Answer 1

看起来你必须自己阅读文件标题两次。

查看Spark的代码，如果用户提供自己的架构，推断的头文件将被完全忽略（从未实际读取），因此无法使Spark在这种不一致的情况下失败。

自己进行比较：

val schema = Encoders.product[T].schema

// read the actual schema; This shouldn't be too expensive as Spark's
// laziness would avoid actually reading the entire file 
val fileSchema = spark.read
  .option("header", "true")
  .csv("test.csv").schema

// read the file using your own schema. You can later use this DF
val df = spark.read.schema(schema)
  .option("header", "true")
  .csv("test.csv")

// compare actual and expected column names:
val badColumnNames = fileSchema.fields.map(_.name)
  .zip(schema.fields.map(_.name))
  .filter { case (actual, expected) => actual != expected }

// fail if any inconsistency found:
assert(badColumnNames.isEmpty, 
  s"file schema does not match expected; Bad column names: ${badColumnNames.mkString("; ")}")

Spark SQL - 用模式读取csv

1 个答案: