Question

我使用Spark 1.6.0和Scala 2.10.5。

$ spark-shell --packages com.databricks:spark-csv_2.10:1.5.0

import org.apache.spark.sql.SQLContext   
import sqlContext.implicits._    
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val bankSchema = StructType(Array(
  StructField("age", IntegerType, true),
  StructField("job", StringType, true),
  StructField("marital", StringType, true),
  StructField("education", StringType, true),
  StructField("default", StringType, true),
  StructField("balance", IntegerType, true),
  StructField("housing", StringType, true),
  StructField("loan", StringType, true),
  StructField("contact", StringType, true),
  StructField("day", IntegerType, true),
  StructField("month", StringType, true),
  StructField("duration", IntegerType, true),
  StructField("campaign", IntegerType, true),
  StructField("pdays", IntegerType, true),
  StructField("previous", IntegerType, true),
  StructField("poutcome", StringType, true),
  StructField("y", StringType, true)))

val market_details = sqlContext.
  read.
  format("com.databricks.spark.csv").
  option("header", "true").
  schema(bankSchema).
  load("/user/sachnil.2007_gmail/Project1_dataset_bank-full.csv")    
market_details.registerTempTable("phone_table")    
val temp = sqlContext.sql("SELECT * FROM phone_table").show()

我得到的错误是：

17/05/14 06:11:42 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NumberFormatException: For input string: "58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"" at 
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at 
    java.lang.Integer.parseInt(Integer.java:580) at 
    java.lang.Integer.parseInt(Integer.java:615) at 
    scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at 
    scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at 
    com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:61) at 
    com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$2.apply(CsvRelation.scala:121) at 
    com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$2.apply(CsvRelation.scala:108) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at 
    scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at 
    scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at 
    scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at 
    scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at 
    scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at 
    scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) at 
    org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)

CSV内容如下：

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"
47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"

我该如何解决？

Answer 1

这里似乎有两个问题：

CSV分隔符

您的CSV数据使用; 作为分隔符，您应添加以下内容

.option("delimiter", ";")

为了使用指示spark使用正确的分隔符

val market_details = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.schema(bankSchema)
.option("delimiter", ";")
.load("/user/sachnil.2007_gmail/Project1_dataset_bank-full.csv")

有关 csv format spark-csv

分隔符：默认情况下，列是使用分隔的，但分隔符可以设置为任何字符

输入数据包括引号（＆＃34;）

您的输入数据包括不需要的＆＃34;
请删除＆＃34;从您的csv输入文件，并再次运行（PSB示例输入）：

age;job;marital;education;default;balance;housing;loan;contact;day;month;duration;campaign;pdays;previous;poutcome;y
58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no
33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no
47;blue-collar;married;unknown;no;1506;yes;no;unknown;5;may;92;1;-1;0;unknown;no

Here you can find spark-sql-csv-examples

Baby Names 示例使用以下CSV输入（标题，后跟样本，不带引号）：

Year,First Name,County,Sex,Count
2013,GAVIN,ST LAWRENCE,M,9
2013,LEVI,ST LAWRENCE,M,9
2013,LOGAN,NEW YORK,M,44

Answer 2

Spark 1.6.0已经过时了，这些天几乎没有人支持它（除非它是商业支持的一部分）。我强烈建议升级到最新版本2.1.1，为您提供多种选择。

让我从这开始：在我的自定义2.3.0-SNAPSHOT构建加载您的CSV文件正常所以我认为您可能遇到了一些不受支持的功能您使用的版本中的spark-csv。

请注意，spark-csv模块已经集成在Spark的Spark 2+中（这是您升级Spark的众多原因之一）。

如果您坚持使用自定义架构（可以让Spark在使用inferSchema选项时自行解决），请至少使用DSL来减少击键次数：

import org.apache.spark.sql.types._

val bankSchema = StructType(
  $"age".int ::
  $"job".string ::
  $"marital".string ::
  $"education".string ::
  $"default".string ::
  $"balance".int ::
  $"housing".string ::
  $"loan".string ::
  $"contact".string ::
  $"day".int ::
  $"month".string ::
  $"duration".int ::
  $"campaign".int ::
  $"pdays".int ::
  $"previous".int ::
  $"poutcome".string ::
  $"y".string ::
  Nil)

scala> println(bankSchema.treeString)
root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- y: string (nullable = true)

如果您使用Scala开发Spark应用程序，我强烈建议使用案例类描述模式并利用编码器（它再次是Spark 2 +）。

case class Market(
  age: Int,
  job: String,
  marital: String,
  education: String,
  default: String,
  balance: Int,
  housing: String,
  loan: String,
  contact: String,
  day: Int,
  month: String,
  duration: Int,
  campaign: Int,
  pdays: Int,
  previous: Int,
  poutcome: String,
  y: String)
import org.apache.spark.sql.Encoders
scala> val bankSchema = Encoders.product[Market]
java.lang.UnsupportedOperationException: `default` is a reserved keyword and cannot be used as field name
- root class: "Market"
  at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:611)
  at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:609)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:609)
  at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:440)
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
  at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
  ... 48 elided

（在这种特殊情况下，由于保留关键字default，您可能因此无法在手工构建的模式中避免这种情况，因此无法实现。）

一旦你有了架构阅读，那么你在问题中包含的样本没有错误：

val marketDetails = spark.
  read.
  schema(bankSchema).
  option("header", true).
  option("delimiter", ";").
  csv("market_details.csv")

scala> marketDetails.show
+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|age|         job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
| 58|  management|married| tertiary|     no|   2143|    yes|  no|unknown|  5|  may|     261|       1|   -1|       0| unknown| no|
| 44|  technician| single|secondary|     no|     29|    yes|  no|unknown|  5|  may|     151|       1|   -1|       0| unknown| no|
| 33|entrepreneur|married|secondary|     no|      2|    yes| yes|unknown|  5|  may|      76|       1|   -1|       0| unknown| no|
| 47| blue-collar|married|  unknown|     no|   1506|    yes|  no|unknown|  5|  may|      92|       1|   -1|       0| unknown| no|
+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+

我真正喜欢Spark SQL的是，如果这是Spark中首选的“语言”，你可以坚持使用纯SQL。

val q = """
  CREATE OR REPLACE TEMPORARY VIEW phone_table
  USING csv
  OPTIONS (
    inferSchema true,
    header true,
    delimiter ';',
    path 'market_details.csv')"""

// execute the above query and discard the result
// we're only interested in the side effect of creating a temp view
sql(q).collect

scala> sql("select * from phone_table").show
+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|age|         job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
| 58|  management|married| tertiary|     no|   2143|    yes|  no|unknown|  5|  may|     261|       1|   -1|       0| unknown| no|
| 44|  technician| single|secondary|     no|     29|    yes|  no|unknown|  5|  may|     151|       1|   -1|       0| unknown| no|
| 33|entrepreneur|married|secondary|     no|      2|    yes| yes|unknown|  5|  may|      76|       1|   -1|       0| unknown| no|
| 47| blue-collar|married|  unknown|     no|   1506|    yes|  no|unknown|  5|  may|      92|       1|   -1|       0| unknown| no|
+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+

PROTIP：使用spark-sql，您可以完全放弃Scala。

为什么使用NumberFormatException从CSV读取失败？

2 个答案: