步骤1
我需要比较两个csv文件,一个是静态的(DB.csv
),另一个是从web Downloaded.csv
下载的(这是动态的,可能是更新的记录)
步骤-2
在比较两个csv的差异后,将写入mongodb
步骤-3
现在Downloaded.csv
文件需要替换DB.csv
,然后步骤1中的相同逻辑将继续。
示例说明
步骤1
DB.csv [temp table `db` ]
sno APPLE BANANA
1 13 11
2 2 22
3 2 22
Downloaded.csv [temp table `downloaded` ]
sno APPLE BANANA
1 n 11
2 2 22
3 2 22
步骤2
Difference dataset
sno APPLE BANANA
1 n 11
步骤3
DB.csv [temp table `db` - updated ]
sno APPLE BANANA
1 n 11
2 2 22
3 2 22
重复 第1步
DB.csv [temp table `db` - updated ]
sno APPLE BANANA
1 n 11
2 2 22
3 2 22
Downloaded.csv [temp table `downloaded` - new downloaded record ]
sno APPLE BANANA
1 n 11
2 2 n
3 2 22
重复 第2步
Difference dataset
sno APPLE BANANA
2 2 n
重复 第3步
DB.csv [temp table `db` ]
sno APPLE BANANA
1 n 11
2 2 n
3 2 22
这是我的逻辑
Dataset<Row> downloaded =spark.read().option("header","true").csv("/home/exa4/Desktop/downloaded.csv");
Dataset<Row> db =spark.read().option("header","true").csv("/home/exa4/Desktop/db.csv");
downloaded.createOrReplaceTempView("downloaded");
db.createOrReplaceTempView("db");
Dataset<Row> diffs= spark.sql("select * from downloaded EXCEPT select * from db");
//write updates to collection
MongoSpark.save(diffs.write().option("collection", "UpdatedRecords").mode("overwrite"));
//replacing old DB with new dataset downloaded
downloaded.createOrReplaceTempView("db");
////For every 10 seconds I may intenstionaly change the downloaded.csv for testing , as it is dynamic dataset
while(true){
long start = System.currentTimeMillis();
Thread.sleep(10000);
//this will be newly downloaded file from net
Dataset<Row> downloaded =spark.read().option("header","true").csv("/home/exa4/Desktop/downloaded.csv");
downloaded.createOrReplaceTempView("downloaded");
//now comparing downloaded with previously updated dataset
Dataset<Row> diffs_= spark.sql("select * from downloaded EXCEPT select * from db");
diffs_.show();
////HERE I AM GETTING NULL RECORDS
downloaded.createOrReplaceTempView("db");
}
答案 0 :(得分:1)
spark.catalog.refreshTable(s"$dbName.$destinationTableName")
替换为dbname和表名