Spark结构化流中的数据验证 - ETL

时间:2017-03-09 13:31:05

标签: scala apache-spark spark-streaming spark-dataframe

我是结构流编程的新手,我想执行ETL。 我有一些验证/健全检查,帮助确定BAD记录。

Input and Output Data Sample

我的代码:

    val schema = StructType(Array(StructField("id",StringType,true),StructField("lastName",StringType,true),StructField("age",StringType,true),StructField("dept",StringType,true),StructField("sal",StringType,true),StructField("gender",StringType,true),StructField("status",StringType,true)))    
val path : String = "F:/Hadoop/Data"
import spark.implicits._
//RegEx expression to check input data
val col1p = "\\d{3}".r
val col2p = "[a-zA-Z]".r
val col3p = "\\d{3}".r
val col4p = "[10|20|30]".r
val col5p = "[0-9]".r
val col6p = "[M|F]".r
var st : String = "1" //Good Record

//Read input file
val empRaw = spark
.readStream
.option("sep", "~")
.schema(schema)
.csv(path)

//Write output
empRaw.writeStream
.outputMode("Append")
.format("console")
.option("checkpointLocation", "F:/Hadoop/Data/log0")
.start()

spark.streams.awaitAnyTermination()

以下是健全性检查/ ETL检查。不知道在哪里申请。 或者还有其他方法可以做同样的事情。

empRaw.rdd.map(r => {if (col1p.findFirstMatchIn(r(0).toString()).isEmpty || 
   (col1p.findFirstMatchIn(r(1).toString()).isEmpty) ||
   (col1p.findFirstMatchIn(r(2).toString()).isEmpty) ||
   (col1p.findFirstMatchIn(r(3).toString()).isEmpty) ||
   (col1p.findFirstMatchIn(r(4).toString()).isEmpty) ||
   (col1p.findFirstMatchIn(r(5).toString()).isEmpty)) 
  {st = "0"}
  Row(r(0).toString().toInt,r(1).toString(),r(2).toString().toInt,r(3).toString().toInt,
      r(4).toString().toInt,r(5).toString(),st)})

1 个答案:

答案 0 :(得分:1)

udf对此用例有用

进口

import org.apache.spark.sql.functions.udf
import spark.implicits._

创建UDF

val validate = udf((id: String,
                    lname: String,
                    age: String,
                    dept: String,
                    sal: String,
                    gender: String) => {
  //Feel free to improve this condition check code
  if (id.matches("\\d{3}") &&
    lname.matches("[a-zA-Z]") &&
    age.matches("\\d{3}") &&
    dept.matches("[10|20|30]") &&
    sal.matches("[0-9]") &&
    gender.matches("[M|F]")) 1
  else 0
})

使用UDF

empRaw.withColumn("status", 
  validate($"id", $"lastName", $"age", $"dept", $"sal", $"gender"))