Question

我正在使用R中的不同数据集。每个数据集最多包含16列和1000个记录。我试图找到一种方法来一次比较两个数据集，以便我可以找到已删除/更新/添加的记录。我将使用ID列和颜色列来识别差异。下面是一个小集合示例（未包括所有列）：

df1 <- data.frame(ID = letters[1:5], color = c("blue", "white", "red", "green", "blue"))

df2 <- data.frame(ID = c("a","c","d","d"), color = c("blue", "yellow", "green", "blue"))

ID将是datasets。

之间的公共因子

我需要比较datasets以获得三组不同的值：

新记录： 出现在df1但不出现在df2中的记录。所以我应该得到：

ID  Color
b   white
c   red
e   blue

删除记录 记录不会出现在df1中，但会出现在df2

中

   ID    Color
    c     yellow
    d     blue

更新记录 这是我需要的最重要的一个。基本上具有不同颜色的相同ID的任何东西：

   ID  df1color  df2color
    c   red       yellow

我尝试过使用dplyr包中的联接....但没有成功。有没有办法可以在R中执行此操作。

Answer 1

您可能遇到的一个问题是data.frame()混淆了幕后的角色和因素变量。查看数据框的str()。相反，最好使用tibble()，您可以从dplyr或tibble包中获取。

然后，从链接@Stedy发布，您可以使用dplyr＆＃39; s anti_join()来处理前两个问题。最后一个可以通过将inner_join()应用于新记录的数据框，然后filter()来查找更改来完成。见下面的例子：

library(dplyr)

df1 <- tibble(ID = c(letters[1:5]), color = c("blue", "white", "red", "green", "blue"))
df2 <- tibble(ID = c("a","c","d","d"), color = c("blue", "yellow", "green", "blue"))

# New Records
anti_join(df1, df2)
#> # A tibble: 3 x 2
#>      ID color
#>   <chr> <chr>
#> 1     e  blue
#> 2     c   red
#> 3     b white

# Deleted records (simply swap arguments around)
anti_join(df2, df1)
#> # A tibble: 2 x 2
#>      ID  color
#>   <chr>  <chr>
#> 1     d   blue
#> 2     c yellow

# Updated records
new_records <- anti_join(df1, df2)
inner_join(new_records, df2, by = "ID", suffix = c(".df1", ".df2")) %>%
  filter(color.df1 != color.df2)
#> # A tibble: 1 × 3
#>      ID color.df1 color.df2
#>   <chr>     <chr>     <chr>
#> 1     c       red    yellow

Answer 2

我认为您的问题可能存在一些问题。例如，df2的ID包括a，c，d和d，所有这些都是df1中的ID值。所以删除的记录矩阵不应该是空的吗？

无论如何，我整理了一个可能是你之后的脚本。如果不是，请告诉我，我会再试一次。

df1 <- data.frame(ID = letters[1:5], color = c("blue", "white", "red", "green", "blue"))

df2 <- data.frame(ID = c("a","c","d","d"), color = c("blue", "yellow", "green", "blue"))

df1=as.matrix(df1)
df2=as.matrix(df2)

##########################
## find the new records ## 
##########################

## define a new record matrix
n.r = matrix(NA,nrow=nrow(df1),ncol=nrow(df2))

## loop over the rows in the new matrix
i=1
while(i<=nrow(n.r)) {
    n <- df1[i,1]==df2[,1]
    n.r[i,] <- n

    i=i+1
}

## these are your new records
df1[-(which(n.r == TRUE, arr.ind=TRUE)[,1]),]

##############################    
## find the deleted records ##
##############################

## define a deleted records matrix
d.r = matrix(NA,ncol=nrow(df1),nrow=nrow(df2))

## loop over the rows in the deleted matrix
i=1
while(i<=nrow(d.r)) {
    d <- df2[i,1]==df1[,1]
    d.r[i,] <- d

    i=i+1
}

## these are your deleted records
df2[-(which(d.r == TRUE, arr.ind=TRUE)[,1]),]

##############################
## find the updated records ##
##############################

## define the same matrix
s.m <- which(n.r==TRUE,arr.ind=TRUE)
## consider the ith row of the same matrix (s.m[i,])
## s.m shows that df1[s.m[i,1],1] == df2[s.m[i,2],1] 

## now define a updated record matrix
u.r <- rep(NA,nrow(s.m))

i=1
while(i<=nrow(s.m)) {
    u.r[i] <- df1[s.m[i,1],2] == df2[s.m[i,2],2]

    i=i+1
}

## these are your updated records
cbind(df1[s.m[which(u.r == FALSE),1],],df2[s.m[which(u.r == FALSE),2],2])

请注意＆＃39; d＆＃39;在df2中显示TWICE，并且在df1中只有一个更新（或不同）。这可能需要进行修改以满足您的需求/目标。

比较表以查找更新/删除/新

2 个答案: