自定义Spark的阅读方式

时间:2016-07-11 18:42:39

标签: scala csv apache-spark

我在CSV文件中有以下数据(实际上,我的实际数据较大,但这是一个很好的简化):

ColumnA,ColumnB
1,X
5,G
9,F

我是按照以下方式阅读的,其中url是文件的位置:

val rawData = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(url)

阅读时,我正在使用https://github.com/databricks/spark-csv

然后,我正在应用地图:

val formattedData = rawData.map(me => me("ColumnA") match {
    //some other code
  })

但是,当我引用这样的列时:me("ColumnA")我的类型不匹配:

Type mismatch, expected: Int, actual: String

为什么会这样? rawData的每一行都不应该是地图吗?

1 个答案:

答案 0 :(得分:2)

when you reference a perticular column in datafram's row, you have several methods to do this. if you are using apply method then you need to pass the index of column. or if you want to get a column by name you need to use getAs[T] function of Row.

so you can use :

me(0)

or

me.getAs[T]("ColumnA")

hope it will help you.