使用Apache Spark解决问题的最佳方法是什么?
我的数据集如下-
dotnet run
当每个ID的“ VALUE”更改时,我需要保留行。
我的预期输出-
ID, DATE, TIME, VALUE
001,2019-01-01, 0010, 150
001,2019-01-01, 0020, 150
001,2019-01-01, 0030, 160
001,2019-01-01, 0040, 160
001,2019-01-01, 0050, 150
002,2019-01-01, 0010, 151
002,2019-01-01, 0020, 151
002,2019-01-01, 0030, 161
002,2019-01-01, 0040, 162
002,2019-01-01, 0051, 152
答案 0 :(得分:2)
您可以在Window中使用const Mongoose = require('mongoose');
Mongoose.connect('mongodb://localhost:27017/sis_dictionary', {useNewUrlParser: true});
const Schema = Mongoose.Schema;
const wordSchema = new Schema({
word: String
})
const Word = Mongoose.model('Word', wordSchema);
app.post('/saveWord', (req, res) => {
var word = new Word({word: String(req.body)});
word.save(function(err){
if(err) {
return console.error(err);
} else {
console.log("STATUS: WORKING");
}
})
console.log(req.body);
})
server.listen(3000);
console.log("SERVER STARTUP SUCCESS");
函数:
lag
给予:
val df = Seq(
("001", "2019-01-01", "0010", "150"),
("001", "2019-01-01", "0020", "150"),
("001", "2019-01-01", "0030", "160"),
("001", "2019-01-01", "0040", "160"),
("001", "2019-01-01", "0050", "150"),
("002", "2019-01-01", "0010", "151"),
("002", "2019-01-01", "0020", "151"),
("002", "2019-01-01", "0030", "161"),
("002", "2019-01-01", "0040", "162"),
("002", "2019-01-01", "0051", "152")
).toDF("ID", "DATE", "TIME", "VALUE")
df
.withColumn("change",coalesce($"VALUE"=!=lag($"VALUE",1).over(Window.partitionBy($"ID").orderBy($"TIME")),lit(true)))
.where($"change")
//.drop($"change")
.show()