以^ passport开头的字符串只需要捕获那些条目
示例:
entry = c("passport AR4133553 expires 11 mar 2019","passport 472420180","passport 563220533 (korea, north)",
"passport iraq","passport m 788439","following data derived from an eritrean passport issued",
"passport and national")
所需输出:数据必须仅捕获护照和国家/地区名称
**passport** **passport_country**
"AR4133553" NA
"472420180" NA
"563220533" "korea, north"
NA "iraq"
"788439" NA
NA NA
NA NA
提前致谢。
答案 0 :(得分:0)
希望这有帮助!
#sample data
entry = c("passport AR4133553 expires 11 mar 2019",
"passport 472420180",
"passport 563220533 (korea, north)",
"passport iraq",
"passport m 788439",
"following data derived from an eritrean passport issued",
"passport and national")
#fetch passport number from sample data (i.e. second string having numbers which is immediately after 'passport')
passport_no <- gsub("^passport\\s((([a-zA-Z]*\\d)|(\\d[a-zA-Z]*))\\S*).*", "\\1", entry, perl=T)
ind <- grep("^passport\\s((([a-zA-Z]*\\d)|(\\d[a-zA-Z]*))\\S*).*", entry, value=F)
passport_no[-ind] <- NA
#fetch passport country from sample data
library(maptools)
data(wrld_simpl)
passport_country <- lapply(gsub("[()]","",entry), function(x)
as.character(wrld_simpl@data$NAME[sapply(wrld_simpl@data$NAME, grepl, x, ignore.case=T)]))
passport_country <- lapply(passport_country, function(x)
if(identical(x, character(0))) NA_character_ else x)
#note that 'Korea, North' is not selected in above comparison as it's offical country name is 'Korea, Democratic People's Republic of'
#final data
df <- data.frame(cbind(passport_no, passport_country))
df
输出是:
passport_no passport_country
1 AR4133553 NA
2 472420180 NA
3 563220533 NA
4 NA Iraq
5 NA NA
6 NA Eritrea
7 NA NA