Question

我有两个数据框（超过100万条记录）。仅约10％的行是不同的。我知道如何找到增量：

df1.subtract(df2)

但我也想知道哪些记录是新记录，哪些记录已更改。我知道一旦有了增量，就可以使用Hive Context进行此操作，但是也许有一些基于pyspark函数的简单方法？

先谢谢了。

Answer 1

只需执行与class MyRoute{ public void configure(){ from("aws-sqs://my-s3-notification-queue" + "?amazonSQSClient=#sqsClient" + "&deleteAfterRead=false") .unmarshal().json(JsonLibrary.Jackson, S3EventNotification.class) .bean(s3NotificationHandler); } }和leftsemi的联接

leftanti

在pyspark中比较两个数据框时，如何找出新内容和更改内容？

1 个答案: