从字符向量中删除不是某些单词的所有单词

时间:2016-05-26 16:34:38

标签: r character text-mining

我有一个看起来像这样的字符列表

[70] "CSF  5896-6133"                                                           
[71] "CRT  16"                                                                  
[72] "SEEF  54-55"                                                              
[73] "CIF  190-195"                                                             
[74] "DE & /ON CIF  196-222"                                                    
[75] " CRT  17 "                                                                
[76] " SEEF  56-57"                                                             
[77] "DE & /ON CSF  6134-6725 "                                                 
[78] " SEEF  58-60"                                                             
[79] "CRT 18"                                                                   
[80] " CSF 6726-6837"                                                           
[81] "SEEF 61"                                                                  
[82] " CSF 6840-6926"                                                           
[83] " CIF 223-226"                                                             
[84] "SEEF 62-63"                                                               
[85] " CSF 6927-7065"                                                           
[86] " CIF 226-228"                                                             
[87] "CSF 7066-7185"                                                            
[88] "CSF 7186-7311"                                                            
[89] " CIF 229"                                                                 
[90] " SEEF 66"                                                                 
[91] "CSF 7312-7561"                                                            
[92] " CRT 19"                                                                  
[93] " SEEF 67-68"                                                              
[94] "Final data QAQC done on CSF  1-7561"                                      
[95] " CIF  1-229"                                                              
[96] " SEEF  1-68 "                                                             
[97] " CRT  1-19"                                                               
[98] "082015-HOBA-G17-1 changed to offPlot based on GIS review of searched     area"

正如您所看到的,这只是其中的一部分。

我想删除所有不是数字或

的单词
CSF, CIF, SEEF, CRT

因此,例如94-98的部分看起来像

[94] "CSF  1-7561"                                      
[95] " CIF  1-229"                                                              
[96] " SEEF  1-68 "                                                             
[97] " CRT  1-19"                                                               

正如您所看到的那样,第98行将完全删除,因为它没有我想要的关键字。第94行也被删除了一些词。

3 个答案:

答案 0 :(得分:3)

考虑以下向量:

v <- c("Final data QAQC done on CSF  1-7561", 
       "CIF  1-229", 
       "SEEF  1-68", 
       "CRT  1-19",
       "082015-HOBA-G17-1 changed to offPlot based on GIS review of searched     area")

你可以这样做:

## vector with words to match
cond <- c("CSF", "CIF", "SEEF", "CRT")
## regex that captures digits and tolerates dashes (-) 
reg  <- "(\\d+-?)+$"
## pattern to match either words or regex 
pattern <- paste(c(cond, reg), collapse = "|")

然后使用stri_extract_all()包中的stringi

library(stringi)
stri_extract_all_regex(v, pattern)

给出了:

#[[1]]
#[1] "CSF"    "1-7561"
#
#[[2]]
#[1] "CIF"   "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT"  "1-19"
#
#[[5]]
#[1] NA

正如@akrun所提到的,你也可以这样做:

regmatches(v, gregexpr(pattern, v))

给出了:

#[[1]]
#[1] "CSF"    "1-7561"
#
#[[2]]
#[1] "CIF"   "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT"  "1-19"
#
#[[5]]
#character(0)

答案 1 :(得分:1)

使用stringr

library(stringr)
testString <- c("Final data QAQC done on CSF  1-7561" ,
                " CIF  1-229" ,
                " SEEF  1-68 ",
                " CRT  1-19",
                "082015-HOBA-G17-1 changed to offPlot based on GIS review of searched     area" )

str_extract(testString, "(CSF|CIF|SEEF|CRT)\\s+\\d+-\\d+")
[1] "CSF  1-7561" "CIF  1-229"  "SEEF  1-68"  "CRT  1-19"   NA 

答案 2 :(得分:0)

我会使用stringr库。

这是您数据的一个子集。

x <- c("CSF  5896-6133",                                                           
"CRT  16",                                                                  
"SEEF  54-55",                                                              
"CIF  190-195",
"Final data QAQC done on CSF  1-7561",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched     area"
)

您可以使用str_extract和与您的模式匹配的正则表达式。

library(stringr)

> str_extract(x, '(CSF|CIF|SEEF|CRT)[:space:]+([0-9]|-)+')
[1] "CSF  5896-6133" "CRT  16"        "SEEF  54-55"    "CIF  190-195"   "CSF  1-7561"   
[6] NA 

当你没有匹配模式时,它将返回一个缺失值。