Question

我有一个包含4列和3000行的数据框。如果列中有四个不同的字符串，我的目的是检查每一行。例如：

第一排：希腊 - 俄罗斯 - 西班牙 - 荷兰
第二排：英格兰 - 德国 - 德国 - 伊朗
第三排：荷兰 - 荷兰 - 英国 - 希腊

因此，R应该给我第2行和第3行，因为有重复。这可能吗？提前谢谢。

Answer 1

我们可以使用apply与MARGIN =1循环遍历行，检查每行中length个元素的unique是否不等于列数获取逻辑向量的数据集，这可以用于对连续至少有一个副本的数据集的行进行子集化。

df1[apply(df1, 1, FUN = function(x) length(unique(x)))!=ncol(df1),]
#       col1        col2    col3   col4
#2     England     Germany Germany   Iran
#3 Netherlands Netherlands Britain Greece

另一种选择是基于正则表达式的方法（应该更快），其中我们paste每行的元素，grep获取重复字符串行的索引，使用正则表达式对行进行子集化。

df1[grep("(\\b\\S+\\b)(?=.*\\1+)", do.call(paste, df1), perl = TRUE),]
#          col1        col2    col3   col4
# 2     England     Germany Germany   Iran
# 3 Netherlands Netherlands Britain Greece

基准

df2 <- df1[rep(1:nrow(df1), 1e6),]
system.time(df2[apply(df2, 1L, anyDuplicated),])
# user  system elapsed 
#  34.34    0.22   34.90 

system.time(df2[grep("(\\b\\S+\\b)(?=.*\\1+)", do.call(paste, df2), perl = TRUE),])
#   user  system elapsed 
#   9.53    0.05    9.61 

system.time(df2[apply(df2, 1, FUN = function(x) length(unique(x)))!=ncol(df2),])
#   user  system elapsed 
#  41.48    0.17   41.71

数据

df1 <- structure(list(col1 = c("Greece", "England", "Netherlands"), 
col2 = c("Russia", "Germany", "Netherlands"), col3 = c("Spain", 
"Germany", "Britain"), col4 = c("Netherlands", "Iran", "Greece"
 )), .Names = c("col1", "col2", "col3", "col4"), row.names = c(NA, 
 -3L), class = "data.frame")

Answer 2

dplyr和tidyr

的解决方案

library(dplyr)
library(tidyr)

df_new <- df %>% 
    mutate(row = row_number()) %>% 
    gather(key, value, -row) %>% 
    group_by(row, value) %>% 
    mutate(n = n()) %>% 
    mutate(duplicate = ifelse(n > 1, TRUE, FALSE)) %>%
    # STOP HERE IF YOU WANT TO SEE DUPLICATES 
    filter(duplicate == TRUE) %>% 
    ungroup() %>% 
    # RUN DISTINCT IF YOU JUST WANT TO SEE ROWS WITH DUPES
    distinct(row)

3000行基准

dfL <- Reduce(rbind, list(df)[rep(1L, times=1000)])
system.time( ... )
#  user  system elapsed 
# 0.004   0.000   0.004

识别R中列中的重复字符串

2 个答案:

基准

数据