我在执行此操作时遇到错误:
;with CTE as (
select ROW_NUMBER() OVER (ORDER BY GEOID, A_ID, Zip, latitude, longitude) as rn,
TableID,
GID,
A_ID,
Zip,
latitude,
longitude from tableA
) update CTE set CTE.TableID = CTE.rn
错误消息:
val input = spark.read.option("header", "true").option("delimiter", "\t").schema(trFile).csv(fileNameWithPath)
val newSchema = trFile.add("ERROR_COMMENTS", StringType, true)
// Call you custom validation function
val validateDS = dataSetMap.map { row => validateColumns(row) } //<== error here
// Reconstruct the DataFrame with additional columns
val checkedDf = spark.createDataFrame(validateDS, newSchema)
def validateColumns(row: Row): Row = {
var err_val: String = null
val effective_date = row.getAs[String]("date")
.................
Row.merge(row, Row(err_val))
}
这是我的模式:
◾Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
◾not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]. Unspecified value parameter evidence$6
我是Spark的新手,请告诉我这里的问题是什么,并且有什么最佳方法可以实现这一目标。我正在使用Spark 2.3版。
答案 0 :(得分:0)
在这种情况下使用UDF
会更容易,那么您就不必担心场景更改,使用row.getAs
等了。
首先,将方法转换为UDF
函数:
import org.apache.spark.sql.functions.udf
val validateColumns = udf((date: String, count: String, name: String)){
// error logic using the 3 column strings
err_val
}
要将新列添加到数据框中,请使用withColumn()
,
val checkedDf = input.withColumn("ERROR_COMMENTS", validateColumns($"date", $"count", $"name"))