我使用以下架构使用Spark的数据源API创建DataFrame。
StructType(Seq(StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("livesIn", StringType, true),
StructField("bornIn", StringType, true)))
我正在使用PrunedFilteredScan的buildScan()方法对数据进行硬编码,如下所示:
val schemaFields = schema.fields
// hardcoded for now. Need to read from Accumulo and plug it here
val rec = List("KBN 1000000 Universe Parangipettai", "Sreedhar 38 Mysore Adoni", "Siva 8 Hyderabad Hyderabad",
"Rishi 23 Blr Hyd", "Ram 45 Chn Hyd", "Abey 12 Del Hyd")
// Reading from Accumulo done. Constructing the RDD now for DF.
val rdd = sqlContext.sparkContext.parallelize(rec)
rdd.count
val rows = rdd.map(rec => {
//println("file ===============>"+file)
val fields = rec.split(" ")
val typeCastedValues = fields.zipWithIndex.map{
case (value, index) => {
//println(s"PRUNED val: ${value} - index: ${index}")
val dataType = schemaFields(index).dataType
typeCast(value, dataType)
}
}
Row.fromSeq(typeCastedValues)
})
rows }
private def typeCast(value: String, toType: DataType) = toType match {
case _: StringType => value
case _: IntegerType => value.toInt }
当我创建DataFrame时,如下所示:
val dfPruned = sqlContext.read.format(dsPackage).load().select("livesIn")
dfPruned.show
dfPruned.printSchema
它为标题name
提供了livesIn
列的数据。如果我遗漏任何东西或者这是Spark 2.1.1中的错误,请帮忙
Ouput
+--------+
| livesIn|
+--------+
| KBN|
|Sreedhar|
| Siva|
| Rishi|
| Ram|
| Abey|
+--------+
root
|-- livesIn: string (nullable = true)
答案 0 :(得分:0)
如果您有dataframe
以及将schema
转换为rdd
Rows
sqlContext.createDataFrame(rows, schema)
然后当你做
val dfPruned = sqlContext.createDataFrame(rows, schema).select("livesIn")
dfPruned.show
dfPruned.printSchema
你应该得到输出
+---------+
| livesIn|
+---------+
| Universe|
| Mysore|
|Hyderabad|
| Blr|
| Chn|
| Del|
+---------+
root
|-- livesIn: string (nullable = true)
<强>被修改强>
如果您想使用Data Source API,那么它更简单
sqlContext.read.format("csv").option("delimiter", " ").schema(schema).load("path to your file ").select("livesIn")
应该这样做。
注意:我正在使用输入文件,如下所示
KBN 1000000 Universe Parangipettai
Sreedhar 38 Mysore Adoni
Siva 8 Hyderabad Hyderabad
Rishi 23 Blr Hyd
Ram 45 Chn Hyd
Abey 12 Del Hyd
答案 1 :(得分:0)
如果您尝试为rdd应用架构,可以使用createDataFrame
函数,如下所示。
// create a row from your data by splitting wit " "
val rows = rdd.map( value => {
val data = value.split(" ")
// you could use Rows.fromSeq(data) but since you need second field as int needs conversion
Row(data(0), data(1).toInt, data(2), data(3))
})
//creating a dataframe with rows and schema
val df = sparkContext.createDataFrame(rows, schema)
// selecting only column livesIn
df.select("livesIn")
输出:
+---------+
| livesIn|
+---------+
| Universe|
| Mysore|
|Hyderabad|
| Blr|
| Chn|
| Del|
+---------+
希望这有用!