我试图从字符串中提取一些单词(国家/地区名称)。字符串是列表元素,例如
myList <- list(associations = c("Madeup speciesone: \r\n\t\t\t\t", "Foobarae foobar: Russia - 123,",
"Foobarus foobar France - 7007,Italy - 7007,Portugal - 6919,Ukraine - 42264,Russia - 7009,",
"Foobarus foobarbar",
"Foobaria foobariana f. sp. foobaricol Japan - 254, China - 256,"))
我想提取国家/地区名称,例如,我可以这样做:
Country_name <- lapply(myList, pattern = "China|France|Italy|Ukraine", str_extract_all)
country_list <- vector()
for(i in 1:length(Country_name[[1]])){
country_list[i] <- paste(Country_name[[1]][[i]], collapse = ",")
}
但是需要列出所有可能的国家才能发挥作用,这似乎很费力。
有没有办法使用正则表达式来提取所有国家/地区名称?有点像从第二个大写字开始,然后提取所有国家直到字符串的结尾?
使用像lapply(myList, word, 3)
这样的东西并不是很有效,因为物种名称的长度可变(例如Foobaria foobariana f.sp.foobaricol)。
# desired output
country_list <- c("","Russia","France,Italy,Portugal,Ukraine,Russia","","Japan,China")
答案 0 :(得分:0)
您可以使用包countrycode
library(countrycode)
countries <- as.data.frame(countrycode_data$country.name)
如果您想坚持自己的代码,可以创建一个字符串,其中包含由&#34; |&#34;
分隔的所有国家/地区名称all <- paste(countrycode_data$country.name, collapse="|")
然后运行
Country_name <- lapply(myList, pattern = all, str_extract_all)
country_list <- vector()
for(i in 1:length(Country_name[[1]])){
country_list[i] <- paste(Country_name[[1]][[i]], collapse = ",")
}
应该给你结果:
myList <- list(associations = c("Madeup speciesone: \r\n\t\t\t\t", "Foobarae foobar: Russia - 123,",
"Foobarus foobar France - 7007,Italy - 7007,Portugal - 6919,Ukraine - 42264,Russia - 7009,",
"Foobarus foobarbar",
"Foobaria foobariana f. sp. foobaricol Japan - 254, China - 256,",
"Germany",
"555Senegal"))
Country_name <- lapply(myList, pattern = all, str_extract_all)
country_list <- vector()
for(i in 1:length(Country_name[[1]])){
country_list[i] <- paste(Country_name[[1]][[i]], collapse = ",")
}
country_list
[1] "" "" "France,Italy,Portugal,Ukraine"
[4] "" "Japan,China" "Germany"
[7] "Senegal"