在R中提取| Grep |子串字符向量

时间:2018-01-03 06:47:32

标签: r string dataframe vector

以^ passport开头的字符串只需要捕获那些条目

示例:

entry = c("passport AR4133553 expires 11 mar 2019","passport 472420180","passport 563220533 (korea, north)",
          "passport iraq","passport m 788439","following data derived from an eritrean passport issued",
          "passport and national") 

所需输出:数据必须仅捕获护照和国家/地区名称

**passport**  **passport_country**  
"AR4133553"   NA   
"472420180"   NA   
"563220533"   "korea, north"  
NA            "iraq"  
"788439"      NA  
NA            NA  
NA            NA  

提前致谢。

1 个答案:

答案 0 :(得分:0)

希望这有帮助!

#sample data
entry = c("passport AR4133553 expires 11 mar 2019",
          "passport 472420180",
          "passport 563220533 (korea, north)",
          "passport iraq",
          "passport m 788439",
          "following data derived from an eritrean passport issued",
          "passport and national") 

#fetch passport number from sample data (i.e. second string having numbers which is immediately after 'passport')
passport_no <- gsub("^passport\\s((([a-zA-Z]*\\d)|(\\d[a-zA-Z]*))\\S*).*", "\\1", entry, perl=T)
ind <- grep("^passport\\s((([a-zA-Z]*\\d)|(\\d[a-zA-Z]*))\\S*).*", entry, value=F)
passport_no[-ind] <- NA

#fetch passport country from sample data
library(maptools)
data(wrld_simpl)
passport_country <- lapply(gsub("[()]","",entry), function(x) 
  as.character(wrld_simpl@data$NAME[sapply(wrld_simpl@data$NAME, grepl, x, ignore.case=T)]))
passport_country <- lapply(passport_country, function(x) 
  if(identical(x, character(0))) NA_character_ else x)
#note that 'Korea, North' is not selected in above comparison as it's offical country name is 'Korea, Democratic People's Republic of'

#final data
df <- data.frame(cbind(passport_no, passport_country))
df

输出是:

  passport_no passport_country
1   AR4133553               NA
2   472420180               NA
3   563220533               NA
4          NA             Iraq
5          NA               NA
6          NA          Eritrea
7          NA               NA