我有一个看起来像这样的字符列表
[70] "CSF 5896-6133"
[71] "CRT 16"
[72] "SEEF 54-55"
[73] "CIF 190-195"
[74] "DE & /ON CIF 196-222"
[75] " CRT 17 "
[76] " SEEF 56-57"
[77] "DE & /ON CSF 6134-6725 "
[78] " SEEF 58-60"
[79] "CRT 18"
[80] " CSF 6726-6837"
[81] "SEEF 61"
[82] " CSF 6840-6926"
[83] " CIF 223-226"
[84] "SEEF 62-63"
[85] " CSF 6927-7065"
[86] " CIF 226-228"
[87] "CSF 7066-7185"
[88] "CSF 7186-7311"
[89] " CIF 229"
[90] " SEEF 66"
[91] "CSF 7312-7561"
[92] " CRT 19"
[93] " SEEF 67-68"
[94] "Final data QAQC done on CSF 1-7561"
[95] " CIF 1-229"
[96] " SEEF 1-68 "
[97] " CRT 1-19"
[98] "082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area"
正如您所看到的,这只是其中的一部分。
我想删除所有不是数字或
的单词CSF, CIF, SEEF, CRT
因此,例如94-98的部分看起来像
[94] "CSF 1-7561"
[95] " CIF 1-229"
[96] " SEEF 1-68 "
[97] " CRT 1-19"
正如您所看到的那样,第98行将完全删除,因为它没有我想要的关键字。第94行也被删除了一些词。
答案 0 :(得分:3)
考虑以下向量:
v <- c("Final data QAQC done on CSF 1-7561",
"CIF 1-229",
"SEEF 1-68",
"CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area")
你可以这样做:
## vector with words to match
cond <- c("CSF", "CIF", "SEEF", "CRT")
## regex that captures digits and tolerates dashes (-)
reg <- "(\\d+-?)+$"
## pattern to match either words or regex
pattern <- paste(c(cond, reg), collapse = "|")
然后使用stri_extract_all()
包中的stringi
:
library(stringi)
stri_extract_all_regex(v, pattern)
给出了:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#[1] NA
正如@akrun所提到的,你也可以这样做:
regmatches(v, gregexpr(pattern, v))
给出了:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#character(0)
答案 1 :(得分:1)
使用stringr
:
library(stringr)
testString <- c("Final data QAQC done on CSF 1-7561" ,
" CIF 1-229" ,
" SEEF 1-68 ",
" CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area" )
str_extract(testString, "(CSF|CIF|SEEF|CRT)\\s+\\d+-\\d+")
[1] "CSF 1-7561" "CIF 1-229" "SEEF 1-68" "CRT 1-19" NA
答案 2 :(得分:0)
我会使用stringr
库。
这是您数据的一个子集。
x <- c("CSF 5896-6133",
"CRT 16",
"SEEF 54-55",
"CIF 190-195",
"Final data QAQC done on CSF 1-7561",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area"
)
您可以使用str_extract
和与您的模式匹配的正则表达式。
library(stringr)
> str_extract(x, '(CSF|CIF|SEEF|CRT)[:space:]+([0-9]|-)+')
[1] "CSF 5896-6133" "CRT 16" "SEEF 54-55" "CIF 190-195" "CSF 1-7561"
[6] NA
当你没有匹配模式时,它将返回一个缺失值。