我正在尝试在CSV文件中查找缺失的序列,这是我拥有的代码
val customSchema = StructType(
Array(
StructField("MessageId", StringType, false),
StructField("msgSeqID", LongType, false),
StructField("site", StringType, false),
StructField("msgType", StringType, false)
)
)
val logFileDF = sparkSession.sqlContext.read.format("csv")
.option("delimiter",",")
.option("header", false)
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load(logFilePath)
.toDF()
logFileDF.printSchema()
logFileDF.repartition(5000)
logFileDF.createOrReplaceTempView("LogMessageData")
sparkSession.sqlContext.cacheTable("LogMessageData")
val selectQuery: String = "SELECT MessageId,site,msgSeqID,msgType,lag(msgSeqID,1) over (partition by site,msgType order by site,msgType,msgSeqID) as prev_val FROM LogMessageData order by site,msgType,msgSeqID"
val logFileLagRowNumDF = sparkSession.sqlContext.sql(selectQuery).toDF()
logFileLagRowNumDF.repartition(1000)
logFileLagRowNumDF.printSchema()
logFileLagRowNumDF.createOrReplaceTempView("LogMessageDataUpdated")
sparkSession.sqlContext.cacheTable("LogMessageDataUpdated")
val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val!=null and msgSeqID != prev_val+1 order by site,msgType,msgSeqID";
val errorRecordQueryDF = sparkSession.sqlContext.sql(errorRecordQuery).toDF()
logger.info("Total. No.of Missing records =[" + errorRecordQueryDF.count() + "]")
val noOfMissingRecords = errorRecordQueryDF.count()
这是我拥有的示例数据
Msg_1,S1,A,10000000
Msg_2,S1,A,10000002
Msg_3,S2,A,10000003
Msg_4,S3,B,10000000
Msg_5,S3,B,10000001
Msg_6,S3,A,10000003
Msg_7,S3,A,10000001
Msg_8,S3,A,10000002
Msg_9,S4,A,10000000
Msg_10,S4,A,10000001
Msg_11,S4,A,10000000
Msg_12,S4,A,10000005
这是我得到的输出。
root
|-- MessageId: string (nullable = true)
|-- site: string (nullable = true)
|-- msgType: string (nullable = true)
|-- msgSeqID: long (nullable = true)
INFO EimComparisonProcessV2 - Total No.Of Records =[12]
+---------+----+-------+--------+
|MessageId|site|msgType|msgSeqID|
+---------+----+-------+--------+
| Msg_1| S1| A|10000000|
| Msg_2| S1| A|10000002|
| Msg_3| S2| A|10000003|
| Msg_4| S3| B|10000000|
| Msg_5| S3| B|10000001|
| Msg_6| S3| A|10000003|
| Msg_7| S3| A|10000001|
| Msg_8| S3| A|10000002|
| Msg_9| S4| A|10000000|
| Msg_10| S4| A|10000001|
| Msg_11| S4| A|10000000|
| Msg_12| S4| A|10000005|
+---------+----+-------+--------+
root
|-- MessageId: string (nullable = true)
|-- site: string (nullable = true)
|-- msgSeqID: long (nullable = true)
|-- msgType: string (nullable = true)
|-- prev_val: long (nullable = true)
+---------+----+--------+-------+--------+
|MessageId|site|msgSeqID|msgType|prev_val|
+---------+----+--------+-------+--------+
| Msg_1| S1|10000000| A| null|
| Msg_2| S1|10000002| A|10000000|
| Msg_3| S2|10000003| A| null|
| Msg_7| S3|10000001| A| null|
| Msg_8| S3|10000002| A|10000001|
+---------+----+--------+-------+--------+
only showing top 5 rows
INFO TestProcess - Total No.Of Records Updated DF=[12]
INFO TestProcess - Total. No.of Missing records =[0]
答案 0 :(得分:0)
只需用此代码行替换您的查询,它将向您显示结果。
val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val is not null and msgSeqID <> prev_val+1 order by site,msgType,msgSeqID";