Spark Dataframe - 无法解析...给定

时间:2017-08-21 13:51:19

标签: scala csv apache-spark apache-spark-sql

我试图在Spark 1.6.0中创建一个数据框。 我用这个命令来创建它: -

val df = sqlContext.read.format("com.databricks.spark.csv")
  .option("header","true")
  .option("delimiter",",")
  .option("inferSchema","true")
  .load("/user/rohitchopra32_gmail/Project1_dataset_bank-full(2).csv")

它创建了一个数据框但是当我尝试使用df.show()命令时,它会显示不完整和未格式化的数据,例如error 当我尝试使用val selectedData = df.select("age")选择数据时 命令它显示错误selected data error

链接到我的数据集: - data set

我是新来的火花,我不知道是什么导致了这个错误。 我错过了什么吗?

1 个答案:

答案 0 :(得分:2)

就像我在评论中说的那样,你的CSV文件格式不正确,所以让我们重写它并解析它:

scala> sc.textFile(filePath).map(x => x.replaceAll("\"", "")).saveAsTextFile("./Downloads/clean_data")

既然我们已经删除了导致我们麻烦的尾随双引号,我们可以使用您拥有的代码行加载CSV:

scala> sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",";").option("inferSchema","true").load("./Downloads/clean_data").show
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
|age|          job| marital|education|default|balance|housing|loan| contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
| 53|      unknown| married|  unknown|     no|      0|     no|  no|cellular| 25|  aug|     209|       5|   -1|       0| unknown| no|
| 51|   technician| married| tertiary|     no|     -3|     no|  no|cellular| 25|  aug|      91|       9|   -1|       0| unknown| no|
| 33|   technician|  single|secondary|     no|    -32|     no|  no|cellular| 25|  aug|     196|      12|   -1|       0| unknown| no|
| 48|   management|divorced| tertiary|     no|      0|     no|  no|cellular| 25|  aug|     110|       3|   -1|       0| unknown| no|
| 60|      retired| married|  primary|     no|    155|     no|  no|cellular| 25|  aug|     115|       7|   -1|       0| unknown| no|
| 50|   management|divorced| tertiary|     no|      0|     no|  no|cellular| 25|  aug|      57|       3|   -1|       0| unknown| no|
| 59|  blue-collar| married|  primary|     no|   6271|    yes|  no|cellular| 25|  aug|     102|       5|   -1|       0| unknown| no|
| 33|   technician|  single| tertiary|     no|    137|     no|  no|cellular| 25|  aug|      88|       4|   -1|       0| unknown| no|
| 37|self-employed| married|secondary|     no|    119|     no|  no|cellular| 25|  aug|      68|       4|   -1|       0| unknown| no|
| 45|  blue-collar| married|  primary|     no|    185|     no|  no|cellular| 25|  aug|      78|       4|   -1|       0| unknown| no|
| 47|   management| married|secondary|     no|   1083|     no|  no|cellular| 25|  aug|     141|       4|   -1|       0| unknown| no|
| 41|   technician| married|secondary|     no|   2039|     no|  no|cellular| 25|  aug|     160|       4|   -1|       0| unknown| no|
| 52|   management| married| tertiary|     no|    967|     no|  no|cellular| 25|  aug|     472|      10|   -1|       0| unknown| no|
| 35|   technician|  single| tertiary|     no|    275|    yes|  no|cellular| 25|  aug|      63|       5|   -1|       0| unknown| no|
| 34|   technician| married|secondary|     no|     47|     no|  no|cellular| 25|  aug|     132|       6|   -1|       0| unknown| no|
| 36|   management| married| tertiary|     no|   1235|     no|  no|cellular| 25|  aug|      85|       6|   -1|       0| unknown| no|
| 32|   technician| married|secondary|    yes|      4|    yes| yes|cellular| 25|  aug|     132|       8|   -1|       0| unknown| no|
| 36|   management| married| tertiary|     no|   3874|     no|  no|cellular| 25|  aug|     425|       6|   -1|       0| unknown| no|
| 58|  blue-collar| married|  unknown|     no|      9|     no|  no|cellular| 25|  aug|      50|      23|   -1|       0| unknown| no|
| 43|   technician| married|secondary|     no|    136|     no|  no|cellular| 25|  aug|     363|       7|   -1|       0| unknown|yes|
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
only showing top 20 rows