Question

步骤1
我需要比较两个csv文件，一个是静态的（DB.csv），另一个是从web Downloaded.csv下载的（这是动态的，可能是更新的记录）

步骤-2
在比较两个csv的差异后，将写入mongodb

步骤-3
现在Downloaded.csv文件需要替换DB.csv，然后步骤1中的相同逻辑将继续。

示例说明

步骤1

DB.csv   [temp table `db` ]

sno APPLE   BANANA
1   13  11
2   2   22
3   2   22

Downloaded.csv [temp table `downloaded` ]
sno APPLE   BANANA
1   n   11
2   2   22
3   2   22

步骤2

Difference dataset
sno APPLE   BANANA
1     n       11

步骤3

DB.csv [temp table `db` - updated ]
sno APPLE   BANANA
 1  n   11
 2  2   22
 3  2   22

重复 第1步

DB.csv [temp table `db` - updated ]
sno APPLE   BANANA
 1  n   11
 2  2   22
 3  2   22

Downloaded.csv [temp table `downloaded` - new downloaded record ]
sno APPLE   BANANA
1   n   11
2   2   n
3   2   22

重复 第2步

Difference dataset
sno APPLE   BANANA
2     2       n

重复 第3步

DB.csv [temp table `db` ]
sno APPLE   BANANA
 1  n   11
 2  2   n
 3  2   22

这是我的逻辑

 Dataset<Row> downloaded =spark.read().option("header","true").csv("/home/exa4/Desktop/downloaded.csv");
     Dataset<Row> db =spark.read().option("header","true").csv("/home/exa4/Desktop/db.csv");
     downloaded.createOrReplaceTempView("downloaded");
     db.createOrReplaceTempView("db");

     Dataset<Row> diffs= spark.sql("select * from downloaded EXCEPT select * from db");

    //write updates to collection
    MongoSpark.save(diffs.write().option("collection", "UpdatedRecords").mode("overwrite"));

    //replacing old DB with new dataset downloaded 
    downloaded.createOrReplaceTempView("db");

     ////For every 10 seconds I may intenstionaly change the downloaded.csv for testing , as it is dynamic dataset 
     while(true){
         long start = System.currentTimeMillis();
            Thread.sleep(10000);

             //this will be newly downloaded file from net 
             Dataset<Row> downloaded =spark.read().option("header","true").csv("/home/exa4/Desktop/downloaded.csv");
             downloaded.createOrReplaceTempView("downloaded");

            //now comparing downloaded with previously updated dataset 
            Dataset<Row> diffs_= spark.sql("select * from downloaded EXCEPT select * from db");
            diffs_.show();
             ////HERE I AM GETTING NULL RECORDS 

            downloaded.createOrReplaceTempView("db");

     }

Answer 1

spark.catalog.refreshTable(s"$dbName.$destinationTableName")

替换为dbname和表名

Spark创建或替换临时视图不会多次更新现有表

1 个答案: