Spark SQL不等于不适用于使用滞后创建的列

时间:2019-12-12 14:32:51

标签: apache-spark apache-spark-sql

我正在尝试在CSV文件中查找缺失的序列,这是我拥有的代码

val customSchema = StructType(
                          Array(
                            StructField("MessageId", StringType, false),
                            StructField("msgSeqID", LongType, false),
                            StructField("site", StringType, false),
                            StructField("msgType", StringType, false)
                         )
                       )
                       
 val logFileDF = sparkSession.sqlContext.read.format("csv")
        .option("delimiter",",")
        .option("header", false)
        .option("mode", "DROPMALFORMED")
        .schema(customSchema)
        .load(logFilePath)
        .toDF()

logFileDF.printSchema()


logFileDF.repartition(5000)
logFileDF.createOrReplaceTempView("LogMessageData")
sparkSession.sqlContext.cacheTable("LogMessageData")

val selectQuery: String = "SELECT MessageId,site,msgSeqID,msgType,lag(msgSeqID,1) over (partition by site,msgType order by site,msgType,msgSeqID) as prev_val FROM LogMessageData order by site,msgType,msgSeqID"

val logFileLagRowNumDF = sparkSession.sqlContext.sql(selectQuery).toDF()

logFileLagRowNumDF.repartition(1000)

logFileLagRowNumDF.printSchema()

logFileLagRowNumDF.createOrReplaceTempView("LogMessageDataUpdated")
sparkSession.sqlContext.cacheTable("LogMessageDataUpdated")

val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val!=null and msgSeqID != prev_val+1 order by site,msgType,msgSeqID";

val errorRecordQueryDF = sparkSession.sqlContext.sql(errorRecordQuery).toDF()

logger.info("Total. No.of Missing records =[" + errorRecordQueryDF.count() + "]")

val noOfMissingRecords = errorRecordQueryDF.count()

这是我拥有的示例数据

Msg_1,S1,A,10000000
Msg_2,S1,A,10000002
Msg_3,S2,A,10000003
Msg_4,S3,B,10000000
Msg_5,S3,B,10000001
Msg_6,S3,A,10000003
Msg_7,S3,A,10000001
Msg_8,S3,A,10000002
Msg_9,S4,A,10000000
Msg_10,S4,A,10000001
Msg_11,S4,A,10000000
Msg_12,S4,A,10000005

这是我得到的输出。

root
 |-- MessageId: string (nullable = true)
 |-- site: string (nullable = true)
 |-- msgType: string (nullable = true)
 |-- msgSeqID: long (nullable = true)
INFO  EimComparisonProcessV2 - Total No.Of Records =[12]
+---------+----+-------+--------+
|MessageId|site|msgType|msgSeqID|
+---------+----+-------+--------+
|    Msg_1|  S1|      A|10000000|
|    Msg_2|  S1|      A|10000002|
|    Msg_3|  S2|      A|10000003|
|    Msg_4|  S3|      B|10000000|
|    Msg_5|  S3|      B|10000001|
|    Msg_6|  S3|      A|10000003|
|    Msg_7|  S3|      A|10000001|
|    Msg_8|  S3|      A|10000002|
|    Msg_9|  S4|      A|10000000|
|   Msg_10|  S4|      A|10000001|
|   Msg_11|  S4|      A|10000000|
|   Msg_12|  S4|      A|10000005|
+---------+----+-------+--------+
root
 |-- MessageId: string (nullable = true)
 |-- site: string (nullable = true)
 |-- msgSeqID: long (nullable = true)
 |-- msgType: string (nullable = true)
 |-- prev_val: long (nullable = true)
+---------+----+--------+-------+--------+
|MessageId|site|msgSeqID|msgType|prev_val|
+---------+----+--------+-------+--------+
|    Msg_1|  S1|10000000|      A|    null|
|    Msg_2|  S1|10000002|      A|10000000|
|    Msg_3|  S2|10000003|      A|    null|
|    Msg_7|  S3|10000001|      A|    null|
|    Msg_8|  S3|10000002|      A|10000001|
+---------+----+--------+-------+--------+
only showing top 5 rows
INFO  TestProcess - Total No.Of Records Updated DF=[12]
INFO  TestProcess - Total. No.of Missing records =[0]

1 个答案:

答案 0 :(得分:0)

只需用此代码行替换您的查询,它将向您显示结果。

val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val is not null and msgSeqID <> prev_val+1 order by site,msgType,msgSeqID";