匹配R中字符串的提取国家/地区名称

时间:2017-12-27 22:46:16

标签: r web-scraping dplyr stringr data-processing

我一直在搜索网站上的评论数据,在此过程中,我可以获得包含用户名,评论数,评论日期和国家/地区信息的字符串向量。他们看起来大致像这样

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
"James (10) - - MEXICO - NOV 22, 2017", 
"Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
"Alex (20000) - SOUTH KOREA- MAR 11, 2015")

到目前为止,我可以提取名称,审核数字和日期,因为它们位于定义的位置或具有一致的格式。问题是国家/地区名称格式不是始终如一,并且每个字符串中的各个数据点不一致用逗号或短划线分隔。只提取大写字符串会遇到缺少国家的问题或者名称分为两部分的问题。

地图包中包含国家/地区列表。有没有办法可以在str_extract_all中使用stringr来查找国家/地区列表向量中的匹配项并提取它?

4 个答案:

答案 0 :(得分:2)

您可以使用maps库执行此操作,如下所示:

library(maps)

## Loading country data from package maps
data(world.cities)

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
     "James (10) - - MEXICO - NOV 22, 2017", 
     "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
     "Alex (20000) - SOUTH KOREA- MAR 11, 2015")

###Removing punctuation
raw <- gsub("[[:punct:]\n]","",raw)

# Split data at word boundaries
raw2 <- strsplit(raw, " ")

# Match on country in world.countries
CountryList_raw <- (lapply(raw2, function(x)x[which(toupper(x) %in% toupper(world.cities$country.etc))]))

do.call(rbind, lapply(CountryList_raw, as.data.frame))

#      X[[i]]
#1        USA
#2     MEXICO
#3    FINLAND
  

这很有效。但是,您需要稍后修复其中包含多个单词的国家/地区的名称。例如,在这种情况下,韩国。这是因为strsplit分裂了这些词,这就是它无法与韩国相匹敌的原因。

答案 1 :(得分:1)

TL; DR

我使用了raw数据并将其转换为数据框。然后,逐列,我使用正则表达式和行迭代的组合提取所需的信息。

导入必要的包和原始数据

要按照教程操作,您需要安装以下软件包:

  • BBmisc:来自B. Bischl和其他一些人的杂项帮助函数,主要用于包开发。

  • maps:绘制地理地图。

  • magrittr:一组使您的代码更具可读性的运算符。

  • purrr:R的完整且一致的函数式编程工具包。

如果已经拥有所有这些功能,则无需使用install.packages()功能。

install.packages( pkgs = c(  "BBmisc", "maps", "magrittr", "purrr" ) )
library( BBmisc )
library( maps )
library( magrittr )
library( purrr )

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
         "James (10) - - MEXICO - NOV 22, 2017", 
         "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
         "Alex (20000) - SOUTH KOREA- MAR 11, 2015")

导入原始数据

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
         "James (10) - - MEXICO - NOV 22, 2017", 
         "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
         "Alex (20000) - SOUTH KOREA- MAR 11, 2015")

声明四列

鉴于存储在raw中的数据,四列感觉适合创建:

  • user_name:用户名

  • user_review_number:与用户评论相关联的识别码

  • user_country:用户的国家/地区

  • user_review_date:日期 - 以月日,年份格式 - 用户的评论已创建

    raw <- data.frame( user_name = raw
           , user_review_number = raw
           , user_country = raw
           , user_review_date = raw
           , stringsAsFactors = FALSE
           )
    

正则表达式

Regular Expressions允许使用特定语法进行复杂灵活的搜索/替换。它们用于从raw数据集中提取相关数据。

识别原始$ user_name

此列包含括号前的用户名。

raw$user_name <- strsplit( x = raw$user_name
                           , split = "\\(|\\)"
                           , fixed = FALSE 
                           )
# keep only the first element from each list, then unlist to obtain a character vector
raw$user_name <- 
  purrr::map( .x = raw$user_name, .f = 1 ) %>%
  unlist()

# remove trailing whitespace
raw$user_name <- trimws( x = raw$user_name
                         , which = "right"
                         )

识别原始$ user_review_number

此列包含用户的评论编号,该编号是两个括号之间1-10位数的整数。

raw$user_review_number <- strsplit( x = raw$user_review_number
                                    , split = "\\(|\\)"
                                    , fixed = FALSE 
                                    )
# keep only the second element from each list, then unlist to obtain a character vector
# and cast as integer
raw$user_review_number <- 
  purrr::map( .x = raw$user_review_number, .f = 2 ) %>%
  unlist() %>%
  as.integer()

识别原始$ user_country

这个专栏有点过时了。一些国家用逗号分隔,其他国家包含两部分名称(即韩国),一些是缩写(即美国),一些包含州信息(即北卡罗来纳州,美国)。

有一百种方法可以做到这一点,但我使用的逻辑包含以下内容:

见下面的代码。

# first, split by the parentheses
raw$user_country <- strsplit( x = raw$user_country
                                    , split = "\\(|\\) "
                                    , fixed = FALSE 
)
# second, keep only the third elements from each list, then unlist to obtain character vector
raw$user_country <- 
  purrr::map( .x = raw$user_country, .f = 3 ) %>%
  unlist()
# third, split by the dash marks, either one or two
raw$user_country <- strsplit( raw$user_country
                          , split = "\\-|\\- \\-"
                          , fixed = FALSE
                          )
# fourth, keep only the second elements from each list, then unlist to obtain character vector
raw$user_country <-
  purrr::map( .x = raw$user_country, .f = 2 ) %>%
  unlist()
# fifth, clear leading and trailing white space
raw$user_country <- trimws( x = raw$user_country
                        , which = "both"
                        )
# sixth, separate states marked by the apperance of a comma
raw$user_country <- strsplit( x = raw$user_country
                         , split = ","
                         , fixed = TRUE
                         ) 
# seventh, make two vectors: 
# one for the first element (which may or not be the state within a country)
maybe.country <- 
  purrr::map( .x = raw$user_country, .f = 1 ) %>%
  unlist()
# one for the second element (which will always be the country)
# note: need to convert NULL elements into NA
definitely.country <-
  purrr::map( .x = raw$user_country, .f = 2, .null = NA ) %>%
  unlist()

# eighth, replace the indices within maybe.country 
#         whose indices in definitely.country are non-NA values
#         with those non-NA values from definitely.country.
# note: this is possible due to the indices within both 
#       maybe.country and definitely.country to be exact equivalents. 
#       (i.e. the 8th element in maybe.country will always align
#        with the 8th element in definitely.country )
maybe.country[
  which( !is.na( definitely.country ) )
  ] <- definitely.country[
    which( !is.na( definitely.country )  )
  ]

# ninth, assign the character vector maybe.country to raw$user_country
raw$user_country <- maybe.country

# tenth, remove all leading and trailing white space
raw$user_country <- trimws( x = raw$user_country
                        , which = "both"
                        )
# eleventh, if the number of letters (length) of any element is more than 3, 
# change the spelling to Capital Case. 
# note: This logic comes from the maps::iso3166 data frame, which contains
#       3,166 country codes from the International Standards Organizations (ISO).
raw$user_country <- ifelse( test = nchar( raw$user_country ) == 2 |
                          nchar( raw$user_country ) == 3
                        , yes = raw$user_country
                        , no = stringr::str_to_title( string = raw$user_country ) 
                        )
# twelfth, check to make sure that all characters are either
# 2 character, 3 character, ISO country codes/names,
# shorter name used in the `maps` package, or the sovereign country
# by ensuring the length of the elements who meet this criteria
# is equal to the length of raw$user_country
length(
  which( raw$user_country %in%  maps::iso3166$a2 |
         raw$user_country %in% maps::iso3166$a3 |
         raw$user_country %in% maps::iso3166$ISOname |
         raw$user_country %in% maps::iso3166$mapname |
         raw$user_country %in% maps::iso3166$sovereignty
       )
) == length( raw$user_country ) # [1] TRUE

识别原始$ user_review_date

假设用户的评论数据始终是要存储在每个字符串中的最后一段文本,请按照以下方式删除此特定列的数据。

raw$user_review_date <- strsplit( x = raw$user_review_date
                                  , split = "\\-\\s"
                                  , fixed = FALSE
                                  )

# keep only the last element from each list, 
# unlist to obtain a character vector,
# standardize the dates 
# note: assumes no NAs will appear for date
raw$user_review_date <- 
  purrr::map( .x = raw$user_review_date, .f = BBmisc::getLast ) %>%
  unlist() %>%
  as.Date( format = "%b %d, %Y" )

答案 2 :(得分:0)

如果

  • 国家/地区名称​​始终以大写字母和
  • 编写
  • 第一个单词以全部大写字母显示,即名称永远不会全部大写,月份字段位于国家/地区字段之后

然后我们可以使用以下正则表达式来提取国家/地区名称:

"[[:upper:]]{2,}[.]?(\\s[[:upper:]]{2,}[.]?)*"

这也适用于多个部分的国家/地区名称或使用点来表示缩写:

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
         "James (10) - - MEXICO - NOV 22, 2017", 
         "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
         "Alex (20000) - SOUTH KOREA- MAR 11, 2015", 
         "Peter (4711) - KINGDOM OF SOUTH NEVERLAND - DEC 24, 2016", 
         "Paul (0815) - REP. OF NORTH NEVERLAND - DEC 31, 2016")
stringr::str_extract(raw, "[[:upper:]]{2,}[.]?(\\s[[:upper:]]{2,}[.]?)*")
[1] "USA"                        "MEXICO"                     "FINLAND"                   
[4] "SOUTH KOREA"                "KINGDOM OF SOUTH NEVERLAND" "REP. OF NORTH NEVERLAND"

解释

"[[:upper:]]{2,}[.]?(\\s[[:upper:]]{2,}[.]?)*"

正在寻找一个由2个或更多大写字母组成的序列,可选地后跟一个点。这将捕获仅包含一个单词的国家/地区名称。

要捕捉由多个单词组成的国家/地区的动画,parantheses中的表达式会查找由空格和另一个带有可选点的大写单词组成的任意数量的子序列。

请注意,stringr::str_extract()仅用于提取第一个匹配项,以避免捕获月份名称。

答案 3 :(得分:0)

我的解决方案建立在上述 Santosh 的解决方案之上,但通过单独搜索每个国家来解决多词国家的问题。

  #remove punctuation
  raw2 <- gsub("[[:punct:]\n]","",raw)
  #get the list of countries we're searching for
  countries = sort(unique(tolower(world.cities$country.etc)))
  #this will be the discovery matrix
  raw3 <- matrix(0,nrow=length(raw),ncol=length(countries))
  colnames(raw3) = countries
  #search for each country by itself
  for(i in countries){
    ind = grep(i,tolower(raw2))
    raw3[ind,i] = 1
  }
  #result is an nxk matrix, where n is the number of obs in raw
  #and k is the number of countries (239 in my test)
  raw3