Question

所以，我在R中有一个数据表，如下所示：

Claim failure Part code matchcode
23     F1     P1    A   B,A,C
23     F1     P1    D   B,A,C
23     F2     P2    D   B,A,C
23     F2     P2    E   B,A,C
45     F1     P4    X   Y,Z,X
45     F1     P4    Y   Y,Z,X
45     F1     P4    A   Y,Z,X
45     F1     P1    F   Y,Z,X
45     F1     P1    H   Y,Z,X

因此，这里的想法是针对每个组（索赔，失败和部分），如果“代码”中的“任何”代码与“ matchcode”列中的代码列表匹配，则该组应将“匹配”列设为True。

因此，预期输出应如下：

Claim failure Part Match 
23     F1     P1    TRUE   
23     F2     P2    FALSE  
45     F1     P4    TRUE
45     F1     P1    FALSE

此data.table的大小很大，因此需要优化的解决方案。请帮助：）

Answer 1

...以及使用data.table的解决方案。

library(data.table)
grepl_v <- Vectorize(grepl)
DT[, .(Match = any(grepl_v(code, matchcode))), by = .(Claim, failure, Part)]
#   Claim failure Part Match
#1:    23      F1   P1  TRUE
#2:    23      F2   P2 FALSE
#3:    45      F1   P4  TRUE
#4:    45      F1   P1 FALSE

数据

DT <- fread("Claim failure Part code matchcode
23     F1     P1    A   B,A,C
23     F1     P1    D   B,A,C
23     F2     P2    D   B,A,C
23     F2     P2    E   B,A,C
45     F1     P4    X   Y,Z,X
45     F1     P4    Y   Y,Z,X
45     F1     P4    A   Y,Z,X
45     F1     P1    F   Y,Z,X
45     F1     P1    H   Y,Z,X")

Answer 2

这是一个dplyr解决方案。

library(dplyr)

dat %>%
  rowwise() %>%
  mutate(Match = grepl(code, matchcode)) %>%
  group_by(Claim, failure, Part) %>%
  mutate(Match = any(Match)) %>%
  select(-code, -matchcode) %>%
  unique
## A tibble: 4 x 4
## Groups:   Claim, failure, Part [4]
#  Claim failure Part  Match
#  <int> <fct>   <fct> <lgl>
#1    23 F1      P1    TRUE 
#2    23 F2      P2    FALSE
#3    45 F1      P4    TRUE 
#4    45 F1      P1    FALSE
#Warning message:
#Grouping rowwise data frame strips rowwise nature

不用担心警告，它只是告诉您，grep按行排列后，管道按某些变量分组，因此处理不再逐行进行。

数据。

dat <- read.table(text = "
Claim failure Part code matchcode
23     F1     P1    A   B,A,C
23     F1     P1    D   B,A,C
23     F2     P2    D   B,A,C
23     F2     P2    E   B,A,C
45     F1     P4    X   Y,Z,X
45     F1     P4    Y   Y,Z,X
45     F1     P4    A   Y,Z,X
45     F1     P1    F   Y,Z,X
45     F1     P1    H   Y,Z,X                  
", header = TRUE)

Answer 3

base解决方案：

d$match <- apply(d, 1, function(x) { return(grepl(x[['code']], x['matchcode']))})

# Claim   failure Part code matchcode match
# 1    23      F1   P1    A     B,A,C     1
# 2    23      F1   P1    D     B,A,C     0
# 3    23      F2   P2    D     B,A,C     0
# 4    23      F2   P2    E     B,A,C     0
# 5    45      F1   P4    X     Y,Z,X     1
# 6    45      F1   P4    Y     Y,Z,X     1
# 7    45      F1   P4    A     Y,Z,X     0
# 8    45      F1   P1    F     Y,Z,X     0
# 9    45      F1   P1    H     Y,Z,X     0

此答案的原始版本使用grep()；感谢markus建议grepl()

如何按组或类别（按组）内的列中的值与R中同一数据帧中的其他列（具有多个值）匹配

3 个答案: