验证数据框:
+---------+---------------------------+-------------------------+
|dataframe|Validation Checks |cols |
+---------+---------------------------+-------------------------+
|Attendee |isEmpty,IsNull |col1,col2,col3 |
+---------+---------------------------+-------------------------+
与会者数据框:
col1 col2 col3
a1 a2 a3
b2 b3
c1 c2 c3
d1 d2 d3
预期结果数据框:
col1 col2 col3 status
a1 a2 a3 clean
b2 b3 dirty
c1 c2 c3 clean
d1 d2 d3 clean
使用的代码:
var columns = df.columns //struct(df.columns map col: _*)
val colDF = df.select(col("dataframe"))
var tablename = colDF.head().toSeq
val checkDF = df.select(col("Validation Checks"))
val opsColDF = df.select(col("cols"))
val opsColumn = opsColDF.columns println("opsColumn :::" + opsColumn)
答案 0 :(得分:0)
如果您有dataframe
作为
+---------+-----------------+--------------+
|dataframe|Validation Checks|cols |
+---------+-----------------+--------------+
|Attendee |isEmpty,isNull |col1,col2,col3|
+---------+-----------------+--------------+
您应使用列值进行SQL查询。我使用udf
函数创建了另一列进行有效查询
import org.apache.spark.sql.functions._
def createQueryUdf = udf((table: String, logic: String, cols: String) => {
"select *, case when "+
cols.split(",")
.map(_.trim)
.map(x => logic.split(",")
.map(_.trim.toLowerCase)
.map{
case y if (y == "isempty") => s"$x like ''"
case y => s"$y($x)"
}.mkString(" or "))
.mkString(" or ") +
s" then 'dirty' else 'clean' end as status from $table"
})
val dataframeWithQuery = df.withColumn("query", createQueryUdf(col("dataframe"), col("Validation Checks"), col("cols")))
所以dataframeWithQuery
将是
+---------+-----------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|dataframe|Validation Checks|cols |query |
+---------+-----------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Attendee |isEmpty,isNull |col1,col2,col3|select *, case when col1 like '' or isnull(col1) or col2 like '' or isnull(col2) or col3 like '' or isnull(col3) then 'dirty' else 'clean' end as status from Attendee|
+---------+-----------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
现在,您可以选择要查询的有效查询,以对数据框进行匹配,但在此之前,数据框都已注册为
attendee.createOrReplaceTempView("Attendee")
然后,您可以collect
{em>查询列和循环来应用查询语句
val queryArray = dataframeWithQuery.select("query").collect.map(_.getAs[String]("query"))
for(query <- queryArray){
spark.sql(query).show(false)
}
应该给您
+----+----+----+------+
|col1|col2|col3|status|
+----+----+----+------+
|a1 |a2 |a3 |clean |
| |b2 |b3 |dirty |
|c1 |c2 |c3 |clean |
|d1 |d2 |d3 |clean |
+----+----+----+------+
现在,您应该对如何进一步进行了解。我希望答案是有帮助的
答案 1 :(得分:0)
package com.incedo.pharma
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, struct}
object objValidation {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("columncheck")
.master("local[*]")
.getOrCreate()
val df = spark.read.format("com.databricks.spark.csv")
.option("header", true)
.option("delimiter", ",")
.load("tablecolcheck.csv")
println("validation dataframe :::::"+df.show())
var AttendeeDF = df
var tableNameArray = df.select(col("tablename")).collect().toSeq
val dataframeWithQuery = df.withColumn("query", createQueryUdf(df("tablename"), df("Validation Checks"), df("cols")))
println("dataframeWithQuery ---------------------"+dataframeWithQuery.show(false))
tableNameArray.foreach(tableArray => {
AttendeeDF = spark.read.format("com.databricks.spark.csv")
.option("header", true)
.option("delimiter", ",")
//.load("AttendeeTable.csv")
.load(tableArray.get(0)+".csv")
println("AttendeeDF ::::"+AttendeeDF.show(false))
AttendeeDF.createOrReplaceTempView("AttendeeTable")
var queryArray = dataframeWithQuery.select("query").collect.map(_.getAs[String]("query"))
println("queryArray ----"+queryArray.toSeq)
for(query <- queryArray){
spark.sql(query).show(false)
}
})
}
def createQueryUdf = udf((table: String, logic: String, cols: String) => {
"select *, case when "+
cols.split(",")
.map(_.trim)
.map(x => logic.split(",")
.map(_.trim.toLowerCase)
.map{
case y if (y == "isempty") => s"$x like ''"
case y if (y == "gt>3") => s"length($x) > 3"
case y => s"$y($x)"
}.mkString(" or "))
.mkString(" or ") +
s" then 'dirty' else 'clean' end as status from $table"
})
}