验证列并将错误消息写在其他列中

时间:2018-08-06 20:34:49

标签: scala apache-spark apache-spark-sql

我在执行此操作时遇到错误:

;with CTE as (
     select ROW_NUMBER() OVER (ORDER BY GEOID, A_ID, Zip, latitude, longitude) as rn,
     TableID,
     GID,
     A_ID,
     Zip,
     latitude,
     longitude from tableA
) update CTE set CTE.TableID = CTE.rn 

错误消息:

val input = spark.read.option("header", "true").option("delimiter", "\t").schema(trFile).csv(fileNameWithPath)

val newSchema = trFile.add("ERROR_COMMENTS", StringType, true)

// Call you custom validation function
val validateDS = dataSetMap.map { row => validateColumns(row) }    //<== error here

// Reconstruct the DataFrame with additional columns                      
val checkedDf = spark.createDataFrame(validateDS, newSchema)

def validateColumns(row: Row): Row = {
  var err_val: String = null
  val effective_date = row.getAs[String]("date")
  .................

  Row.merge(row, Row(err_val))
}

这是我的模式:

◾Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. 
◾not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]. Unspecified value parameter evidence$6

我是Spark的新手,请告诉我这里的问题是什么,并且有什么最佳方法可以实现这一目标。我正在使用Spark 2.3版。

1 个答案:

答案 0 :(得分:0)

在这种情况下使用UDF会更容易,那么您就不必担心场景更改,使用row.getAs等了。

首先,将方法转换为UDF函数:

import org.apache.spark.sql.functions.udf

val validateColumns = udf((date: String, count: String, name: String)){
 // error logic using the 3 column strings
  err_val
}

要将新列添加到数据框中,请使用withColumn()

val checkedDf = input.withColumn("ERROR_COMMENTS", validateColumns($"date", $"count", $"name"))