Spark Struct结构域名称在UDF中更改

时间:2017-03-08 14:48:51

标签: apache-spark struct udf

我正在尝试将spark中的结构传递给udf。它正在更改字段名称并重命名为列位置。我如何解决它?

object TestCSV {

          def main(args: Array[String]) {

            val conf = new SparkConf().setAppName("localTest").setMaster("local")
            val sc = new SparkContext(conf)
            val sqlContext = new SQLContext(sc)


            val inputData = sqlContext.read.format("com.databricks.spark.csv")
                  .option("delimiter","|")
                  .option("header", "true")
                  .load("test.csv")


            inputData.printSchema()

            inputData.show()

            val groupedData = inputData.withColumn("name",struct(inputData("firstname"),inputData("lastname")))

            val udfApply = groupedData.withColumn("newName",processName(groupedData("name")))

           udfApply.show()
          }



             def processName = udf((input:Row) =>{

                println(input)
                println(input.schema)

                Map("firstName" -> input.getAs[String]("firstname"), "lastName" -> input.getAs[String]("lastname"))

              })

        }

输出:

 root
 |-- id: string (nullable = true)
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)

 +---+---------+--------+
 | id|firstname|lastname|
 +---+---------+--------+
 |  1|     jack| reacher|
 |  2|     john|     Doe|
 +---+---------+--------+

错误:

  

[插孔,伸缩器]   StructType(StructField(i [1],StringType,true),> StructField(i [2],StringType,true))   17/03/08 09:45:35错误执行者:阶段2.0(TID 2)中任务0.0的异常   java.lang.IllegalArgumentException:Field" firstname"不存在。

1 个答案:

答案 0 :(得分:2)

你遇到的事情真的很奇怪。在玩了一下后我终于发现它可能与优化器引擎的问题有关。似乎问题不是UDF而是struct函数。

cache groupedData import org.apache.spark.sql.Row import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{SparkConf, SparkContext} object Demo { def main(args: Array[String]): Unit = { val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[1]")) val sqlContext = new HiveContext(sc) import sqlContext.implicits._ import org.apache.spark.sql.functions._ def processName = udf((input: Row) => { Map("firstName" -> input.getAs[String]("firstname"), "lastName" -> input.getAs[String]("lastname")) }) val inputData = sc.parallelize( Seq(("1", "Kevin", "Costner")) ).toDF("id", "firstname", "lastname") val groupedData = inputData.withColumn("name", struct(inputData("firstname"), inputData("lastname"))) .cache() // does not work without cache val udfApply = groupedData.withColumn("newName", processName(groupedData("name"))) udfApply.show() } } 时,我开始工作(Spark 1.6.3),没有缓存我得到报告的异常:

case class Name(firstname:String,lastname:String) // define outside main

val groupedData = inputData.rdd
    .map{r =>
        (r.getAs[String]("id"),
          Name(
            r.getAs[String]("firstname"),
            r.getAs[String]("lastname")
          )
        )
    }
   .toDF("id","name")

或者你可以使用RDD API来制作你的结构,但这不是很好:

nbe.site.embed_write(test, {
	id : 'fon5o6gtr3gh9gsl7kb2',
	read : 'g4xhwwdi34hm19w9fcpm',
	write : 'ppnx1ye1kl6unk10ov08'
});