将CSV转换为RDD并使用Spark / Scala读取

时间:2017-03-29 10:15:30

标签: scala csv apache-spark rdd

我尝试读取文件(csv)并打印其架构。我的问题是我的文件没有像SQL一样查询的标题。 我试过这段代码:

val logFile = "../resouces/cells.csv"

val dfCells = spark.read
 .format("csv")
 .option("header", "false")
 .option("mode", "DROPMALFORMED")
 .option("delimiter", "|")
 .csv(logFile)

dfCell.printSchema;

文件输入为:

ES|15032017|25100|54600||3G|FIBRE|OUTDOOR|COMPANY|MAST|MACRO||47001|DU|41.651834|-4.728534||||||||||||||||
ES|15032017|25101|54601||3G|FIBRE|OUTDOOR|COMPANY|ROOFTOP|MACRO||47001|DU|41.651994|-4.724693||||||||||||||||
ES|15032017|25102|54602||4G|FIBRE|OUTDOOR|COMPANY|ROOFTOP|MICRO||47001|U|41.650912|-4.720648||||||||||||||||
ES|15032017|25103|54603||3G|MICROWAVES|OUTDOOR|COMPANY|ROOFTOP|MACRO||47001|U|41.647312|-4.717118||||||||||||||||

输出结果为:

|
|
|

2 个答案:

答案 0 :(得分:2)

看起来你有一个错字。使用dfCells.printSchema

答案 1 :(得分:0)

我使用带有load函数的Spark 1.5.0而不是csv

val logFile = "../input.csv"

val dfCells = sqlContext.read
                        .format("csv")
                        .option("header", "false")
                        .option("mode", "DROPMALFORMED")
                        .option("delimiter", "|")
                        .load(logFile)

dfCells.show()
+---+--------+-----+-----+---+---+----------+-------+-------+-------+-----+---+-----+---+---------+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| C0|      C1|   C2|   C3| C4| C5|        C6|     C7|     C8|     C9|  C10|C11|  C12|C13|      C14|      C15|C16|C17|C18|C19|C20|C21|C22|C23|C24|C25|C26|C27|C28|C29|C30|C31|
+---+--------+-----+-----+---+---+----------+-------+-------+-------+-----+---+-----+---+---------+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| ES|15032017|25100|54600|   | 3G|     FIBRE|OUTDOOR|COMPANY|   MAST|MACRO|   |47001| DU|41.651834|-4.728534|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
| ES|15032017|25101|54601|   | 3G|     FIBRE|OUTDOOR|COMPANY|ROOFTOP|MACRO|   |47001| DU|41.651994|-4.724693|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
| ES|15032017|25102|54602|   | 4G|     FIBRE|OUTDOOR|COMPANY|ROOFTOP|MICRO|   |47001|  U|41.650912|-4.720648|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
| ES|15032017|25103|54603|   | 3G|MICROWAVES|OUTDOOR|COMPANY|ROOFTOP|MACRO|   |47001|  U|41.647312|-4.717118|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
+---+--------+-----+-----+---+---+----------+-------+-------+-------+-----+---+-----+---+---------+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

,架构是:

dfCells.printSchema()
root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: string (nullable = true)
 |-- C4: string (nullable = true)
 |-- C5: string (nullable = true)
 |-- C6: string (nullable = true)
 |-- C7: string (nullable = true)
 |-- C8: string (nullable = true)
 |-- C9: string (nullable = true)
 |-- C10: string (nullable = true)
 |-- C11: string (nullable = true)
 |-- C12: string (nullable = true)
 |-- C13: string (nullable = true)
 |-- C14: string (nullable = true)
 |-- C15: string (nullable = true)
 |-- C16: string (nullable = true)
 |-- C17: string (nullable = true)
 |-- C18: string (nullable = true)
 |-- C19: string (nullable = true)
 |-- C20: string (nullable = true)
 |-- C21: string (nullable = true)
 |-- C22: string (nullable = true)
 |-- C23: string (nullable = true)
 |-- C24: string (nullable = true)
 |-- C25: string (nullable = true)
 |-- C26: string (nullable = true)
 |-- C27: string (nullable = true)
 |-- C28: string (nullable = true)
 |-- C29: string (nullable = true)
 |-- C30: string (nullable = true)
 |-- C31: string (nullable = true)