Question

我有一组由Sqoop生成的一个带有mySQL数据库的CSV。我试图将它们定义为Spark中数据帧的来源。

源数据库中的模式包含多个具有Long数据类型的字段，并且实际上在这些字段中存储了巨号。

当尝试访问数据帧时，Scala会解释这些因为我在长整数上没有L后缀。

例如，这会引发错误：val test: Long = 20130102180600

虽然成功了：val test: Long = 20130102180600L

有没有办法强制Scala将这些解释为没有该后缀的Long Integers？由于数据的规模，我认为对字段进行后处理是不可行的。

Answer 1

显式提供架构，如README中的示例：

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
    StructField("year", IntegerType, true),
    StructField("make", StringType, true),
    StructField("model", StringType, true),
    StructField("comment", StringType, true),
    StructField("blank", StringType, true)))

val df = sqlContext.load(
    "com.databricks.spark.csv",
    schema = customSchema,
    Map("path" -> "cars.csv", "header" -> "true"))

val selectedData = df.select("year", "model")
selectedData.save("newcars.csv", "com.databricks.spark.csv")

当然，除了使用LongType大整数字段外。

查看代码，这看起来确实应该有效：字段为converted from String to desired type using TypeCast.castTo，而TypeCast.castTo just calls datum.toLong LongType可按预期工作（您可以查看{ Scala REPL中的{1}}。实际上，"20130102180600".toLong也处理了这种情况。我强烈怀疑问题是不同的：也许数字甚至超出InferSchema范围？

（我实际上没有尝试过这个，但我希望它可以工作;如果没有，你应该报告错误。首先阅读https://stackoverflow.com/help/mcve。）

如何在没有＆＃34; L＆＃34;的情况下让Spark SQL导入Long。后缀？

1 个答案: