Question

我一直在使用dplyr::all_equal来尝试找出数据集之间的差异。当数据集不相等时，我并不总是理解输出。

我生成了一些小标题，试图通过对这些标题之间的简单比较来尝试弄清输出的含义，但是不同的输出使我感到困惑。我看了一下文档，并没有为我提供令人满意的解释，因为除了位置之外，没有任何细节可以解释结果如何解释差异。文档中的示例也没有真正涵盖这种情况。

library(tidyverse)
set.seed(123)

df1 <- as_tibble(rpois(4, 2))
df2 <- as_tibble(rpois(4, 2))
df3 <- as_tibble(rpois(4, 2))
df4 <- as_tibble(rpois(4, 2))

df1
#> # A tibble: 4 x 1
#>   value
#>   <int>
#> 1     1
#> 2     3
#> 3     2
#> 4     4
df2
#> # A tibble: 4 x 1
#>   value
#>   <int>
#> 1     4
#> 2     0
#> 3     2
#> 4     4
df3
#> # A tibble: 4 x 1
#>   value
#>   <int>
#> 1     2
#> 2     2
#> 3     5
#> 4     2
df4
#> # A tibble: 4 x 1
#>   value
#>   <int>
#> 1     3
#> 2     2
#> 3     0
#> 4     4

all_equal(df1, df2)
#> [1] "Rows in x but not y: 1, 2. Rows in y but not x: 2. Rows with difference occurences in x and y: 4"
all_equal(df1, df4)
#> [1] "Rows in x but not y: 1. Rows in y but not x: 3. "
all_equal(df1, df3)
#> [1] "Rows in x but not y: 1, 2, 4. Rows in y but not x: 3. Rows with difference occurences in x and y: 3"
all_equal(df2, df3)
#> [1] "Rows in x but not y: 2, 1. Rows in y but not x: 3. Rows with difference occurences in x and y: 3"
all_equal(df2, df4)
#> [1] "Rows in y but not x: 1. Rows with difference occurences in x and y: 1"

^{由reprex package（v0.2.1）于2019-06-26创建}

如果根据上面的输出，如果有人问我“两组之间有多少个观测值不同”，我的回答将是“行在__而不是__：number”中返回的最大行数。因此，例如，我会说：

“ df1和df3之间的观测值不同，为3。”

这是正确的主意吗？另外，我不知道如何解释“在x和y上出现差异的行：数字”部分，因为在all_equal(df1, df2)中，各组之间存在两种不同的观察结果，但是在第4行中，条目是一样。

Answer 1

我最近不得不对双数据输入做类似的事情，并使用了基数R。并非完全符合您的要求，但我希望它会有所帮助。可以根据具体情况（例如，mapply(`==`, df1, df2)）来简化此操作，但是由于您提到了4，因此我针对大量数据帧量身定制了答案以扩展规模。下面的代码测试每个数据帧，行，为了平等。请记住，此解决方案取决于顺序（与all_equal不同），如果您的数据帧没有相同的column＃/ row＃，则需要在可行之前对其进行调整。祝好运！！！

library(tidyverse)
set.seed(123)

df1 <- as_tibble(rpois(4, 2))
df2 <- as_tibble(rpois(4, 2))
df3 <- as_tibble(rpois(4, 2))
df4 <- as_tibble(rpois(4, 2))

# Making a list of your dataframes 
df_list <- mget(ls(pattern = "df\\d"))

# Creating indices for the comparison (from df_list)
indices <- combn(seq_along(df_list), 2, simplify = F)

# Comparing all elements of the df_list
comparisons <- lapply(indices, function(x) mapply(`==`, df_list[x[1]], df_list[x[2]]))

# Cleaning up names
names(comparisons) <- sapply(indices, paste, collapse = " vs ")

head(comparisons, 2)
$`1 vs 2`
       df1
[1,] FALSE
[2,] FALSE
[3,]  TRUE
[4,]  TRUE

$`1 vs 3`
       df1
[1,] FALSE
[2,] FALSE
[3,] FALSE
[4,] FALSE

# Now, summarise it however you like, e.g.: Pct agreement
sapply(comparisons, mean)
1 vs 2 1 vs 3 1 vs 4 2 vs 3 2 vs 4 3 vs 4 
  0.50   0.00   0.25   0.00   0.25   0.25

编辑：以上解决方案类似于使用all_equal(df, df, ignore_col_order = FALSE, ignore_row_order = FALSE)

了解all_equal

1 个答案: