我正在尝试将spark中的结构传递给udf。它正在更改字段名称并重命名为列位置。我如何解决它?
object TestCSV {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("localTest").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val inputData = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter","|")
.option("header", "true")
.load("test.csv")
inputData.printSchema()
inputData.show()
val groupedData = inputData.withColumn("name",struct(inputData("firstname"),inputData("lastname")))
val udfApply = groupedData.withColumn("newName",processName(groupedData("name")))
udfApply.show()
}
def processName = udf((input:Row) =>{
println(input)
println(input.schema)
Map("firstName" -> input.getAs[String]("firstname"), "lastName" -> input.getAs[String]("lastname"))
})
}
输出:
root
|-- id: string (nullable = true)
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
+---+---------+--------+
| id|firstname|lastname|
+---+---------+--------+
| 1| jack| reacher|
| 2| john| Doe|
+---+---------+--------+
错误:
[插孔,伸缩器] StructType(StructField(i [1],StringType,true),> StructField(i [2],StringType,true)) 17/03/08 09:45:35错误执行者:阶段2.0(TID 2)中任务0.0的异常 java.lang.IllegalArgumentException:Field" firstname"不存在。
答案 0 :(得分:2)
你遇到的事情真的很奇怪。在玩了一下后我终于发现它可能与优化器引擎的问题有关。似乎问题不是UDF而是struct
函数。
当cache
groupedData
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
object Demo {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[1]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions._
def processName = udf((input: Row) => {
Map("firstName" -> input.getAs[String]("firstname"), "lastName" -> input.getAs[String]("lastname"))
})
val inputData =
sc.parallelize(
Seq(("1", "Kevin", "Costner"))
).toDF("id", "firstname", "lastname")
val groupedData = inputData.withColumn("name", struct(inputData("firstname"), inputData("lastname")))
.cache() // does not work without cache
val udfApply = groupedData.withColumn("newName", processName(groupedData("name")))
udfApply.show()
}
}
时,我开始工作(Spark 1.6.3),没有缓存我得到报告的异常:
case class Name(firstname:String,lastname:String) // define outside main
val groupedData = inputData.rdd
.map{r =>
(r.getAs[String]("id"),
Name(
r.getAs[String]("firstname"),
r.getAs[String]("lastname")
)
)
}
.toDF("id","name")
或者你可以使用RDD API来制作你的结构,但这不是很好:
nbe.site.embed_write(test, {
id : 'fon5o6gtr3gh9gsl7kb2',
read : 'g4xhwwdi34hm19w9fcpm',
write : 'ppnx1ye1kl6unk10ov08'
});