Question

R将因子存储为整数。因此，当使用相同的函数时，如果两个因子具有不同的级别，则无法找到两个同名的因素。

这是一个MWE：

y <- structure(list(portfolio_date = structure(c(1L, 1L, 1L, 2L, 2L, 
2L), .Label = c("2000-10-31", "2001-04-30"), class = "factor"), 
security = structure(c(2L, 2L, 1L, 3L, 2L, 4L), .Label = c("Currency Australia (Fwd)", 
"Currency Euro (Fwd)", "Currency Japan (Fwd)", "Currency United Kingdom (Fwd)"
), class = "factor")), .Names = c("portfolio_date", "security"
), row.names = c(10414L, 10417L, 10424L, 21770L, 21771L, 21774L
), class = "data.frame")

x <- structure(list(portfolio_date = structure(1L, .Label = "2000-10-31", class = "factor"), 
security = structure(1L, .Label = "Currency Euro (Fwd)", class = "factor")),
 .Names = c("portfolio_date", "security"), row.names = 10414L, class = "data.frame")

identical(y[1,], x)

返回FALSE

但是如果我们查看对象，它们看起来与用户相同

y[1,]
portfolio_date            security
10414     2000-10-31 Currency Euro (Fwd)

x
portfolio_date            security
10414     2000-10-31 Currency Euro (Fwd)

最终我希望能够做到以下几点：

apply(y, 1, identical, x)
10414 10417 10424 21770 21771 21774 
TRUE TRUE FALSE FALSE FALSE FALSE 
which(apply(y, 1, identical, x))
1 2

有关如何实现这一目标的任何建议？感谢。

Answer 1

一种选择是使用rowwise中的dplyr逐行检查;如果您需要同时比较row.names，则需要为两者创建id列，否则，它将为前两行返回TRUE。

library(dplyr)
x$id <- row.names(x)
y$id <- row.names(y)
rowwise(y) %>% do(check = isTRUE(all.equal(., x, check.attributes = F))) %>% data.frame

  check
1  TRUE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE

Answer 2

使用包＆＃39;比较＆＃39;。

library(compare)
result <- NULL
for (i in 1:NROW(y)){ 
one <- compare(y[i,], x, dropLevels=T)
two <- one$detailedResult[1]==T & one$detailedResult[2]==T
result <- c(result, two)
}
as.character(result)#TRUE  TRUE FALSE FALSE FALSE FALSE

Answer 3

OP中发布的数据解决方案

使用droplevels()可以轻松处理OP中发布的示例。

让我们先看一下比较identical(y[1,], x)返回FALSE的原因：

str(y[1,]) 
#'data.frame':  1 obs. of  2 variables:  
#$ portfolio_date: Factor w/ 2 levels "2000-10-31","2001-04-30": 1  
#$ security      : Factor w/ 4 levels "Currency Australia (Fwd)",..: 2

，而

str(x)
#'data.frame':  1 obs. of  2 variables:
#$ portfolio_date: Factor w/ 1 level "2000-10-31": 1
#$ security      : Factor w/ 1 level "Currency Euro (Fwd)": 1

所以差异在于因素，即使两个对象以相同的方式显示，如OP的问题所示。

这是函数droplevels()很有用的地方：它删除了未使用的因子。通过将droplevels()应用于y[1,]及其冗余因子，我们获得：

identical(droplevels(y[1,]), x)
#[1] TRUE

如果x也包含未使用的因素，则有必要将其包含在droplevels()中。无论如何，它不会造成任何伤害：

identical(droplevels(y[1,]), droplevels(x))
#[1] TRUE

一般解决方案

如果真实数据比OP中“MWE”中发布的数据更复杂，则使用droplevels()可能不起作用。这种情况可以包括例如x和y[1,]中存储为不同因子级别的等同条目。此答案结尾处的数据部分中提供了droplevels()失败的示例。

以下解决方案代表了处理此类一般情况的有效可能性。它适用于OP中发布的数据以及下面发布的数据的更复杂情况。

首先，创建两个仅包含每行字符的辅助向量。通过使用paste()，我们可以将每一行连接到一个字符串：

temp_x <- apply(x, 1, paste, collapse=",")
temp_y <- apply(y, 1, paste, collapse=",")

使用这些向量，可以轻松比较原始data.frames的行，即使条目最初存储为具有不同级别和编号的因子。

为了确定哪些行是相同的，我们可以使用%in%运算符，在这种情况下，它比函数identical()更合适，因为前者检查所有可能行组合的相等性，并且不只是个人对。

通过这些简单的修改，可以快速获得所需的输出，而无需进一步循环：

setNames(temp_y %in% temp_x, names(temp_y))
#10414 10417 10424 21770 21771 21774 
# TRUE  TRUE FALSE FALSE FALSE FALSE 
which(temp_y %in% temp_x)
#[1] 1 2
y[temp_y %in% temp_x,]
#      portfolio_date            security
#10414     2000-10-31 Currency Euro (Fwd)
#10417     2000-10-31 Currency Euro (Fwd)

数据

x <- structure(list(portfolio_date = structure(1:2, .Label = c("2000-05-15", "2000-10-31"), class = "factor"), security = structure(c(2L, 1L), .Label = c("Currency Euro (Fwd)", "Currency USD (Fwd)"), class = "factor")), .Names = c("portfolio_date", "security"), class = "data.frame", row.names = c("10234", "10414")) y <- structure(list(portfolio_date = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("2000-10-31", "2001-04-30"), class = "factor"), security = structure(c(2L, 2L, 1L, 3L, 2L, 4L), .Label = c("Currency Australia (Fwd)", "Currency Euro (Fwd)", "Currency Japan (Fwd)", "Currency United Kingdom (Fwd)"), class = "factor")), .Names = c("portfolio_date", "security"), row.names = c(10414L, 10417L, 10424L, 21770L, 21771L, 21774L), class = "data.frame")

Answer 4

为了进行比较，需要将因子转换为角色对象。通过单独使用基础R，这是一个解决方案：

apply(apply(y, 2, as.character), 1, identical, apply(x, 2, as.character))

内部apply循环将源数据帧和目标数据帧中的每一列转换为字符对象，外部应用循环遍历行。如果x数据框有多行，则实际行为可能与预期不符。

功能行为＆＃34;相同＆＃34;有因素

4 个答案: