找出两个表之间的差异

时间:2015-01-20 23:03:32

标签: r dataframe

我正在使用SAS / SQL背景中的R工作,我正在尝试编写代码来获取两个表,比较它们,并提供差异列表。此代码将重复用于许多不同的表集,因此我需要避免硬编码。

我正在与Identifying specific differences between two data sets in R合作,但它并没有让我一直在那里。

示例数据,使用LastName / FirstName(唯一)的组合作为键 -

Dataset One --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4321 Tower St    54321   10
Don        Bob         771  North Ave   23232   5
Smith      Mike        732 South Blvd.  77777   3        

Dataset Two --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4111 Tower St    32132   17
Donn       Bob         771  North Ave   11111   5

   Desired Output --

   LastName FirstName VarName         TableOne        TableTwo
   Doe      Jane      StreetAddress   4321 Tower St   4111 Tower St 
   Doe      Jane      Zip             23232           32132
   Doe      Jane      VisitCount      5               17

请注意,此输出会忽略两个表中不具有相同ID的记录(例如,因为Bob的姓氏是" Don"在一个表中,以及& #34; Donn"在另一张表中,我们完全忽略了该记录。

我已经通过在两个数据集上应用融合函数,然后比较它们来探索这样做,但是我使用的大小数据表明它不实用。在SAS中,我使用了Proc Compare来完成这项工作,但我还没有找到R中的确切等价物。

3 个答案:

答案 0 :(得分:8)

以下是基于data.table的解决方案:

library(data.table)

# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)

# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]

#     Last_Name First_Name        VarName      TableOne      TableTwo
#     1:       Doe       Jane Street_Address 4321 Tower St 4111 Tower St
#     2:       Doe       Jane            ZIP         54321         32132
#     3:       Doe       Jane     VisitCount            10            17

输入数据集是:

# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", 
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", 
"771  North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", 
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))                                                                                                               

d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", 
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", 
"771  North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", 
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))

答案 1 :(得分:6)

dplyrtidyr在这里运作良好。首先,略微减少数据集:

dat1 <- data.frame(Last_Name = c('Doe', 'Doe', 'Don', 'Smith'),
                   First_Name = c('John', 'Jane', 'Bob', 'Mike'),
                   ZIP = c(12345, 54321, 23232, 77777),
                   VisitCount = c(20, 10, 5, 3),
                   stringsAsFactors = FALSE)
dat2 <- data.frame(Last_Name = c('Doe', 'Doe', 'Donn'),
                   First_Name = c('John', 'Jane', 'Bob'),
                   ZIP = c(12345, 32132, 11111),
                   VisitCount = c(20, 17, 5),
                   stringsAsFactors = FALSE)

(抱歉,我并不想全部输入。如果它很重要,请提供一个reproducible example明确定义的数据结构。)

此外,它看起来像你想要的输出&#34;与Jane Doe ZIPVisitCount有点关系。

你想融化它们的想法很有效:

library(dplyr)
library(tidyr)
dat1g <- gather(dat1, key, value, -Last_Name, -First_Name)
dat2g <- gather(dat2, key, value, -Last_Name, -First_Name)
head(dat1g)
##   Last_Name First_Name        key value
## 1       Doe       John        ZIP 12345
## 2       Doe       Jane        ZIP 54321
## 3       Don        Bob        ZIP 23232
## 4     Smith       Mike        ZIP 77777
## 5       Doe       John VisitCount    20
## 6       Doe       Jane VisitCount    10

从这里开始,它看似简单:

dat1g %>%
    inner_join(dat2g, by = c('Last_Name', 'First_Name', 'key')) %>%
    filter(value.x != value.y)
##   Last_Name First_Name        key value.x value.y
## 1       Doe       Jane        ZIP   54321   32132
## 2       Doe       Jane VisitCount      10      17

答案 2 :(得分:2)

dataCompareR包旨在解决这个问题。包的插图包括一些简单的例子,我用这个包来解决下面的原始问题。

免责声明:我参与了创建此软件包。

library(dataCompareR)

d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", "Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", "771  North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))                                                                                                               

d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", "Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", "771  North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))

compd1d2 <- rCompare(d1, d2, keys = c("First_Name", "Last_Name"))

print(compd1d2)

All columns were compared, 3 row(s) were dropped from comparison
There are  3 mismatched variables:
First and last 5 observations for the  3 mismatched variables
FIRST_NAME LAST_NAME        valueA        valueB       variable     typeA  typeB diffAB
1       Jane       Doe 4321 Tower St 4111 Tower St STREET_ADDRESS character character       
2       Jane       Doe            10            17     VISITCOUNT   integer   integer     -7
3       Jane       Doe         54321         32132            ZIP   integer   integer  22189

要获得更详细和漂亮的摘要,用户可以运行

summary(compd1d2)

使用FIRST_NAME和LAST_NAME作为两个表之间的“连接”由keys =函数的rCompare参数控制。在这种情况下,从这两个变量中不匹配的任何行都会从比较中删除,但是您可以使用summary

获得更详细的输出结果。