保留特定的字符串部分

时间:2017-03-03 12:29:08

标签: r regex

在数据框中

 df <- structure(list(Var1 = structure(1:19, .Label = c("S2107810801_BY20", 
"S2107810801_BY20_CT", "S2111660501_BY3", "S2111660501_BY3_CT", 
"S2111660501_SE26", "S2111660501_SE27", "S2111660501_SE27_CT", 
"S2111660501_SE8", "S2111803201_SE12", "S2111831801_SE24", "S2112650301_SE21", 
"S2112650301_SE21_CT", "S2112650301_SE25", "S2112650301_SE25_CT", 
"S2113810301_BY12", "S2113810301_BY12_CT", "UNKNOWN", "XTYSKPLSKOLA_BY23", 
"XTYSKPLSKOLA_BY23_CT"), class = "factor"), Freq = c(341L, 14L, 
273L, 14L, 66L, 42L, 7L, 48L, 14L, 183L, 21L, 7L, 238L, 7L, 1202L, 
188L, 10L, 35L, 7L), per = c(12.5506072874494, 0.515274199484726, 
10.0478468899522, 0.515274199484726, 2.42914979757085, 1.54582259845418, 
0.257637099742363, 1.76665439823335, 0.515274199484726, 6.73536989326463, 
0.772911299227089, 0.257637099742363, 8.75966139124034, 0.257637099742363, 
44.23997055576, 6.9193963930806, 0.368052999631947, 1.28818549871181, 
0.257637099742363)), .Names = c("Var1", "Freq", "per"), row.names = c(NA, 
-19L), class = "data.frame")

我想将字符串Var1的特定部分保留在新变量land中。我认为我可以使用gsub,但我不知道它是否可以删除多个值。除了Var1之外,我想从le <- c("SE", "BY")删除所有内容。我用了

df %>% mutate(land = gsub("[1-9]","",Var1)))

但正如我所写,我不知道如何强制gsub删除其他字符和数字。

2 个答案:

答案 0 :(得分:2)

这个正则表达式应该可行。请注意,sub如果没有匹配则返回完整字符串。

sub("^.*_(SE|BY).*$", "\\1", df$Var1)
 [1] "BY"      "BY"      "BY"      "BY"      "SE"      "SE"      "SE"      "SE"      "SE"      "SE"      "SE"     
[12] "SE"      "SE"      "SE"      "BY"      "BY"      "UNKNOWN" "BY"      "BY"

此处\\1用于反向引用捕获的()所需值。使用了锚^$,有时风险.*与任何字符集中的0个匹配更多。

答案 1 :(得分:2)

我们可以使用str_extract

library(stringr)
df %>%
   mutate(land = str_extract(Var1, paste(le, collapse="|")))
#                   Var1 Freq        per land
#1      S2107810801_BY20  341 12.5506073   BY
#2   S2107810801_BY20_CT   14  0.5152742   BY
#3       S2111660501_BY3  273 10.0478469   BY
#4    S2111660501_BY3_CT   14  0.5152742   BY
#5      S2111660501_SE26   66  2.4291498   SE
#6      S2111660501_SE27   42  1.5458226   SE
#7   S2111660501_SE27_CT    7  0.2576371   SE
#8       S2111660501_SE8   48  1.7666544   SE
#9      S2111803201_SE12   14  0.5152742   SE
#10     S2111831801_SE24  183  6.7353699   SE
#11     S2112650301_SE21   21  0.7729113   SE
#12  S2112650301_SE21_CT    7  0.2576371   SE
#13     S2112650301_SE25  238  8.7596614   SE
#14  S2112650301_SE25_CT    7  0.2576371   SE
#15     S2113810301_BY12 1202 44.2399706   BY
#16  S2113810301_BY12_CT  188  6.9193964   BY
#17              UNKNOWN   10  0.3680530 <NA>
#18    XTYSKPLSKOLA_BY23   35  1.2881855   BY
#19 XTYSKPLSKOLA_BY23_CT    7  0.2576371   BY