我试图在Spark 1.6.0中创建一个数据框。 我用这个命令来创建它: -
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter",",")
.option("inferSchema","true")
.load("/user/rohitchopra32_gmail/Project1_dataset_bank-full(2).csv")
它创建了一个数据框但是当我尝试使用df.show()
命令时,它会显示不完整和未格式化的数据,例如
当我尝试使用val selectedData = df.select("age")
选择数据时
命令它显示错误
链接到我的数据集: - data set
我是新来的火花,我不知道是什么导致了这个错误。 我错过了什么吗?
答案 0 :(得分:2)
就像我在评论中说的那样,你的CSV文件格式不正确,所以让我们重写它并解析它:
scala> sc.textFile(filePath).map(x => x.replaceAll("\"", "")).saveAsTextFile("./Downloads/clean_data")
既然我们已经删除了导致我们麻烦的尾随双引号,我们可以使用您拥有的代码行加载CSV:
scala> sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",";").option("inferSchema","true").load("./Downloads/clean_data").show
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
|age| job| marital|education|default|balance|housing|loan| contact|day|month|duration|campaign|pdays|previous|poutcome| y|
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
| 53| unknown| married| unknown| no| 0| no| no|cellular| 25| aug| 209| 5| -1| 0| unknown| no|
| 51| technician| married| tertiary| no| -3| no| no|cellular| 25| aug| 91| 9| -1| 0| unknown| no|
| 33| technician| single|secondary| no| -32| no| no|cellular| 25| aug| 196| 12| -1| 0| unknown| no|
| 48| management|divorced| tertiary| no| 0| no| no|cellular| 25| aug| 110| 3| -1| 0| unknown| no|
| 60| retired| married| primary| no| 155| no| no|cellular| 25| aug| 115| 7| -1| 0| unknown| no|
| 50| management|divorced| tertiary| no| 0| no| no|cellular| 25| aug| 57| 3| -1| 0| unknown| no|
| 59| blue-collar| married| primary| no| 6271| yes| no|cellular| 25| aug| 102| 5| -1| 0| unknown| no|
| 33| technician| single| tertiary| no| 137| no| no|cellular| 25| aug| 88| 4| -1| 0| unknown| no|
| 37|self-employed| married|secondary| no| 119| no| no|cellular| 25| aug| 68| 4| -1| 0| unknown| no|
| 45| blue-collar| married| primary| no| 185| no| no|cellular| 25| aug| 78| 4| -1| 0| unknown| no|
| 47| management| married|secondary| no| 1083| no| no|cellular| 25| aug| 141| 4| -1| 0| unknown| no|
| 41| technician| married|secondary| no| 2039| no| no|cellular| 25| aug| 160| 4| -1| 0| unknown| no|
| 52| management| married| tertiary| no| 967| no| no|cellular| 25| aug| 472| 10| -1| 0| unknown| no|
| 35| technician| single| tertiary| no| 275| yes| no|cellular| 25| aug| 63| 5| -1| 0| unknown| no|
| 34| technician| married|secondary| no| 47| no| no|cellular| 25| aug| 132| 6| -1| 0| unknown| no|
| 36| management| married| tertiary| no| 1235| no| no|cellular| 25| aug| 85| 6| -1| 0| unknown| no|
| 32| technician| married|secondary| yes| 4| yes| yes|cellular| 25| aug| 132| 8| -1| 0| unknown| no|
| 36| management| married| tertiary| no| 3874| no| no|cellular| 25| aug| 425| 6| -1| 0| unknown| no|
| 58| blue-collar| married| unknown| no| 9| no| no|cellular| 25| aug| 50| 23| -1| 0| unknown| no|
| 43| technician| married|secondary| no| 136| no| no|cellular| 25| aug| 363| 7| -1| 0| unknown|yes|
+---+-------------+--------+---------+-------+-------+-------+----+--------+---+-----+--------+--------+-----+--------+--------+---+
only showing top 20 rows