我是结构流编程的新手,我想执行ETL。 我有一些验证/健全检查,帮助确定BAD记录。
我的代码:
val schema = StructType(Array(StructField("id",StringType,true),StructField("lastName",StringType,true),StructField("age",StringType,true),StructField("dept",StringType,true),StructField("sal",StringType,true),StructField("gender",StringType,true),StructField("status",StringType,true)))
val path : String = "F:/Hadoop/Data"
import spark.implicits._
//RegEx expression to check input data
val col1p = "\\d{3}".r
val col2p = "[a-zA-Z]".r
val col3p = "\\d{3}".r
val col4p = "[10|20|30]".r
val col5p = "[0-9]".r
val col6p = "[M|F]".r
var st : String = "1" //Good Record
//Read input file
val empRaw = spark
.readStream
.option("sep", "~")
.schema(schema)
.csv(path)
//Write output
empRaw.writeStream
.outputMode("Append")
.format("console")
.option("checkpointLocation", "F:/Hadoop/Data/log0")
.start()
spark.streams.awaitAnyTermination()
以下是健全性检查/ ETL检查。不知道在哪里申请。 或者还有其他方法可以做同样的事情。
empRaw.rdd.map(r => {if (col1p.findFirstMatchIn(r(0).toString()).isEmpty ||
(col1p.findFirstMatchIn(r(1).toString()).isEmpty) ||
(col1p.findFirstMatchIn(r(2).toString()).isEmpty) ||
(col1p.findFirstMatchIn(r(3).toString()).isEmpty) ||
(col1p.findFirstMatchIn(r(4).toString()).isEmpty) ||
(col1p.findFirstMatchIn(r(5).toString()).isEmpty))
{st = "0"}
Row(r(0).toString().toInt,r(1).toString(),r(2).toString().toInt,r(3).toString().toInt,
r(4).toString().toInt,r(5).toString(),st)})
答案 0 :(得分:1)
udf
对此用例有用
import org.apache.spark.sql.functions.udf
import spark.implicits._
val validate = udf((id: String,
lname: String,
age: String,
dept: String,
sal: String,
gender: String) => {
//Feel free to improve this condition check code
if (id.matches("\\d{3}") &&
lname.matches("[a-zA-Z]") &&
age.matches("\\d{3}") &&
dept.matches("[10|20|30]") &&
sal.matches("[0-9]") &&
gender.matches("[M|F]")) 1
else 0
})
empRaw.withColumn("status",
validate($"id", $"lastName", $"age", $"dept", $"sal", $"gender"))