我在下面提到了datafram:
df <- read.table(text =
"code Num mail identifier U_id
YY-12 12345 jjf@gmail.com ar145j U-111
YY-13 12345 jjf@gmail.com Ra145J U-111
YY-14 48654 ert@gmail.com at188R U-112
YY-15 48654 Ert@gmail.com At819R U-113
YY-16 88994 fty@ymail.com fr789U U-114
YY-17 88994 fty@ymail.com Rf789X U-115
YY-18 14500 foi@ymail.com xr747Y U-116
YY-19 14500 foi@ymail.com xY747C U-117", header = T)
利用上述数据框,我想获取这些行的子集,其中对于相同的Num
和mail
,我们使用具有连续2位数字差异的不同标识符。
例如在下面提到的输出中,标识符ar145j
更改为Ra145J
。
必需的输出:
code Num mail identifier U_id
YY-12 12345 jjf@gmail.com ar145j U-111
YY-13 12345 jjf@gmail.com Ra145J U-111
YY-14 48654 ert@gmail.com at188R U-112
YY-15 48654 Ert@gmail.com At819R U-113
答案 0 :(得分:0)
也许这会有所帮助
library(tidyverse)
library(stringi)
df %>%
group_by(Num, mail) %>%
filter(n() == 1 | toupper(first(substr(identifier, 1, 2))) ==
stri_reverse(toupper(last(substr(identifier, 1, 2)))))
# A tibble: 6 x 5
# Groups: Num, mail [4]
# code Num mail identifier U_id
# <fct> <int> <fct> <fct> <fct>
#1 YY-12 12345 jjf@gmail.com ar145j U-111
#2 YY-13 12345 jjf@gmail.com Ra145J U-111
#3 YY-14 48654 ert@gmail.com at188R U-112
#4 YY-15 48654 Ert@gmail.com At819R U-113
#5 YY-16 88994 fty@ymail.com fr789U U-114
#6 YY-17 88994 fty@ymail.com Rf789X U-115