Question

我试图从字符串中提取一些单词（国家/地区名称）。字符串是列表元素，例如

myList <- list(associations =  c("Madeup speciesone: \r\n\t\t\t\t",  "Foobarae foobar: Russia - 123,",
                              "Foobarus foobar France -  7007,Italy -  7007,Portugal -  6919,Ukraine -  42264,Russia -  7009,", 
                              "Foobarus foobarbar", 
                              "Foobaria foobariana f. sp. foobaricol Japan - 254, China - 256,"))

我想提取国家/地区名称，例如，我可以这样做：

Country_name <- lapply(myList, pattern = "China|France|Italy|Ukraine", str_extract_all)
country_list <- vector()
for(i in 1:length(Country_name[[1]])){
  country_list[i] <- paste(Country_name[[1]][[i]], collapse = ",")
}

但是需要列出所有可能的国家才能发挥作用，这似乎很费力。

有没有办法使用正则表达式来提取所有国家/地区名称？有点像从第二个大写字开始，然后提取所有国家直到字符串的结尾？

使用像lapply(myList, word, 3)这样的东西并不是很有效，因为物种名称的长度可变（例如Foobaria foobariana f.sp.foobaricol）。

# desired output
country_list <- c("","Russia","France,Italy,Portugal,Ukraine,Russia","","Japan,China")

Answer 1

您可以使用包countrycode

提取国家/地区名称

library(countrycode)
countries <- as.data.frame(countrycode_data$country.name)

如果您想坚持自己的代码，可以创建一个字符串，其中包含由＆＃34; |＆＃34;

分隔的所有国家/地区名称

all <- paste(countrycode_data$country.name, collapse="|")

然后运行

Country_name <- lapply(myList, pattern = all, str_extract_all)

country_list <- vector()
for(i in 1:length(Country_name[[1]])){
country_list[i] <- paste(Country_name[[1]][[i]], collapse = ",")
}

应该给你结果：

myList <- list(associations =  c("Madeup speciesone: \r\n\t\t\t\t",  "Foobarae foobar: Russia - 123,",
                             "Foobarus foobar France -  7007,Italy -  7007,Portugal -  6919,Ukraine -  42264,Russia -  7009,", 
                             "Foobarus foobarbar", 
                             "Foobaria foobariana f. sp. foobaricol Japan - 254, China - 256,",
                             "Germany",
                             "555Senegal")) 

Country_name <- lapply(myList, pattern = all, str_extract_all)

country_list <- vector()

for(i in 1:length(Country_name[[1]])){
country_list[i] <- paste(Country_name[[1]][[i]], collapse = ",")
}

country_list
[1] ""          ""                "France,Italy,Portugal,Ukraine"
[4] ""          "Japan,China"     "Germany"                      
[7] "Senegal"

stringr根据captialization和position

1 个答案: