使用Apache Spark更改数据捕获

时间:2019-09-29 10:48:38

标签: apache-spark window-functions

使用Apache Spark解决问题的最佳方法是什么?

我的数据集如下-

dotnet run

当每个ID的“ VALUE”更改时,我需要保留行。

我的预期输出-

ID, DATE,       TIME, VALUE
001,2019-01-01, 0010, 150
001,2019-01-01, 0020, 150
001,2019-01-01, 0030, 160
001,2019-01-01, 0040, 160
001,2019-01-01, 0050, 150
002,2019-01-01, 0010, 151
002,2019-01-01, 0020, 151
002,2019-01-01, 0030, 161
002,2019-01-01, 0040, 162
002,2019-01-01, 0051, 152

1 个答案:

答案 0 :(得分:2)

您可以在Window中使用const Mongoose = require('mongoose'); Mongoose.connect('mongodb://localhost:27017/sis_dictionary', {useNewUrlParser: true}); const Schema = Mongoose.Schema; const wordSchema = new Schema({ word: String }) const Word = Mongoose.model('Word', wordSchema); app.post('/saveWord', (req, res) => { var word = new Word({word: String(req.body)}); word.save(function(err){ if(err) { return console.error(err); } else { console.log("STATUS: WORKING"); } }) console.log(req.body); }) server.listen(3000); console.log("SERVER STARTUP SUCCESS"); 函数:

lag

给予:

val df = Seq(
  ("001", "2019-01-01", "0010", "150"),
  ("001", "2019-01-01", "0020", "150"),
  ("001", "2019-01-01", "0030", "160"),
  ("001", "2019-01-01", "0040", "160"),
  ("001", "2019-01-01", "0050", "150"),
  ("002", "2019-01-01", "0010", "151"),
  ("002", "2019-01-01", "0020", "151"),
  ("002", "2019-01-01", "0030", "161"),
  ("002", "2019-01-01", "0040", "162"),
  ("002", "2019-01-01", "0051", "152")
).toDF("ID", "DATE", "TIME", "VALUE")


df
  .withColumn("change",coalesce($"VALUE"=!=lag($"VALUE",1).over(Window.partitionBy($"ID").orderBy($"TIME")),lit(true)))
  .where($"change")
  //.drop($"change")
  .show()