使用select()方法时,Spark 2.1.1 DataFrame提供了错误的列

时间:2017-07-03 06:56:45

标签: apache-spark

我使用以下架构使用Spark的数据源API创建DataFrame。

StructType(Seq(StructField("name", StringType, true), 
                        StructField("age", IntegerType, true),
                        StructField("livesIn", StringType, true),
                        StructField("bornIn", StringType, true)))

我正在使用PrunedFilteredScan的buildScan()方法对数据进行硬编码,如下所示:

val schemaFields = schema.fields
// hardcoded for now. Need to read from Accumulo and plug it here
val rec = List("KBN 1000000 Universe Parangipettai", "Sreedhar 38 Mysore Adoni", "Siva 8 Hyderabad Hyderabad",
                "Rishi 23 Blr Hyd", "Ram 45 Chn Hyd", "Abey 12 Del Hyd")

// Reading from Accumulo done. Constructing the RDD now for DF.
val rdd = sqlContext.sparkContext.parallelize(rec)        
rdd.count
val rows = rdd.map(rec => {
  //println("file ===============>"+file)
  val fields = rec.split(" ")

  val typeCastedValues = fields.zipWithIndex.map{
    case (value, index) => {
      //println(s"PRUNED val: ${value} - index: ${index}")

      val dataType = schemaFields(index).dataType
      typeCast(value, dataType)
    }
  }
  Row.fromSeq(typeCastedValues)
})
rows }
private def typeCast(value: String, toType: DataType) = toType match {
case _: StringType      => value
case _: IntegerType     => value.toInt }

当我创建DataFrame时,如下所示:

val dfPruned = sqlContext.read.format(dsPackage).load().select("livesIn")
dfPruned.show
dfPruned.printSchema

它为标题name提供了livesIn列的数据。如果我遗漏任何东西或者这是Spark 2.1.1中的错误,请帮忙 Ouput

+--------+
| livesIn|
+--------+
|     KBN|
|Sreedhar|
|    Siva|
|   Rishi|
|     Ram|
|    Abey|
+--------+

root
 |-- livesIn: string (nullable = true)

2 个答案:

答案 0 :(得分:0)

如果您有dataframe以及将schema转换为rdd

,则应创建Rows
sqlContext.createDataFrame(rows, schema)

然后当你做

val dfPruned = sqlContext.createDataFrame(rows, schema).select("livesIn")
dfPruned.show
dfPruned.printSchema

你应该得到输出

+---------+
|  livesIn|
+---------+
| Universe|
|   Mysore|
|Hyderabad|
|      Blr|
|      Chn|
|      Del|
+---------+

root
 |-- livesIn: string (nullable = true)

<强>被修改

如果您想使用Data Source API,那么它更简单

sqlContext.read.format("csv").option("delimiter", " ").schema(schema).load("path to your file ").select("livesIn")

应该这样做。

注意:我正在使用输入文件,如下所示

KBN 1000000 Universe Parangipettai
Sreedhar 38 Mysore Adoni
Siva 8 Hyderabad Hyderabad
Rishi 23 Blr Hyd
Ram 45 Chn Hyd
Abey 12 Del Hyd

答案 1 :(得分:0)

如果您尝试为rdd应用架构,可以使用createDataFrame函数,如下所示。

   // create a row from your data by splitting wit " "
   val rows = rdd.map( value => {
      val data = value.split(" ")
   // you could use Rows.fromSeq(data) but since you need second field as int needs conversion

      Row(data(0), data(1).toInt, data(2), data(3))
    })

   //creating a dataframe with rows and schema 
    val df = sparkContext.createDataFrame(rows, schema)


    // selecting only column livesIn 
    df.select("livesIn")

输出:

+---------+
|  livesIn|
+---------+
| Universe|
|   Mysore|
|Hyderabad|
|      Blr|
|      Chn|
|      Del|
+---------+ 

希望这有用!