使用R从大文本中提取城市名称

时间:2018-01-26 02:41:26

标签: r extract

您好我在这里有一个有趣的问题。假设我有一个长字符,其中包括其他人之间的城市名称。

test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"

我的目标是提取它的所有城市名称。我通过以下五个步骤实现了它。

   #replace | with ,
   test2<-str_replace_all(test, "[|]", ", ")

   # Remove punctuation from data
   test3<-gsub("[[:punct:]\n]","",test2)

   # Split data at word boundaries
   test4 <- strsplit(test3, " ")

   # Load data from package maps
   data(world.cities)

   # Match on cities in world.cities
   citiestest<-lapply(test4, function(x)x[which(x %in% world.cities$name)])

结果可能是正确的

citiestest
[[1]]
 [1] "San"        "Boston"     "Boston"     "Washington" "York"      
 [6] "York"       "Kettering"  "York"       "York"       "Charlotte" 
[11] "Carolina"   "Cleveland"  "Nashville"  "Seattle"    "Seattle"   
[16] "Washington" "Asan"      

但正如你所看到的那样,我不能处理具有两个单词的城市(纽约,圣地亚哥等),因为它们是分开的。当然,手动修复此问题不是一个选项,因为我的真实数据集非常大。

6 个答案:

答案 0 :(得分:2)

以下是使用strsplitsub的基本R选项:

terms <- unlist(strsplit(test, "\\s*\\|\\s*"))
cities <- sapply(terms, function(x) gsub("[^,]+,\\s*([^,]+),.*", "\\1", x))
cities[1:3]

            Ucsd Medical Center, San Diego, California, USA 
                                                "San Diego" 
            Yale Cancer Center, New Haven, Connecticut, USA 
                                                "New Haven" 
Massachusetts General Hospital., Boston, Massachusetts, USA
                                                   "Boston"

Demo

答案 1 :(得分:2)

一种相当不同的方法可能或多或少有用,具体取决于手头的数据:将每个地址传递给地理编码API,然后将城市拉出响应。

library(tidyverse)

places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>% 
    separate_rows(string, sep = '\\|')

places <- places %>% 
    mutate(geodata = map(string, ~{Sys.sleep(1); ggmap::geocode(.x, output = 'all')}))

places <- places %>% 
    mutate(address_components = map(geodata, list('results', 1, 'address_components')),
           address_components = map(address_components, 
                                    ~as_data_frame(transpose(.x)) %>% 
                                        unnest(long_name, short_name)),
           city = map(address_components, unnest),
           city = map_chr(city, ~{
               l <- set_names(.x$long_name, .x$types); 
               coalesce(l['locality'], l['administrative_area_level_1'])
           }))

比较结果和原始

places %>% select(city, string)
#> # A tibble: 17 x 2
#>    city       string                                                                               
#>    <chr>      <chr>                                                                                
#>  1 San Diego  Ucsd Medical Center, San Diego, California, USA                                      
#>  2 New Haven  Yale Cancer Center, New Haven, Connecticut, USA                                      
#>  3 Boston     Massachusetts General Hospital., Boston, Massachusetts, USA                          
#>  4 Boston     Dana Farber Cancer Institute, Boston, Massachusetts, USA                             
#>  5 St. Louis  Washington University, Saint Louis, Missouri, USA                                    
#>  6 New York   Mount SInai Medical Center, New York, New York, USA                                  
#>  7 New York   Memorial Sloan Kettering Cancer Center, New York, New York, USA                      
#>  8 Charlotte  Carolinas Healthcare System, Charlotte, North Carolina, USA                          
#>  9 Cleveland  University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville  Vanderbilt University Medical Center, Nashville, Tennessee, USA                      
#> 11 Seattle    Seattle Cancer Care Alliance, Seattle, Washington, USA                               
#> 12 Goyang-si  National Cancer Center, Gyeonggi-do, Korea, Republic of                              
#> 13 서울특별시 Seoul National University Hospital, Seoul, Korea, Republic of                        
#> 14 Seoul      Severance Hospital, Yonsei University Health System, Seoul, Korea,  Republic of       
#> 15 Seoul      Korea University Guro Hospital, Seoul, Korea, Republic of                            
#> 16 Seoul      Asan Medical Center., Seoul, Korea, Republic of                                      
#> 17 Amsterdam  VU MEDISCH CENTRUM; Dept. of Medical Oncology   

......好吧,它并不完美。最大的问题是,对于美国城市,城市被归类为localities,而对于韩国,城市被归类为administrative_area_level_1(在美国是国家)。与其他韩国行不同,12实际上有一个地点,这不是列出的城市(作为行政区域的响应)。此外,&#34;首尔&#34;第13行被莫名其妙地翻译成韩文。

好消息是&#34;圣路易斯&#34;已被缩短为&#34; St.路易斯&#34;,这是一个更标准化的形式,最后一排位于阿姆斯特丹。

扩展这种方法可能需要向Google支付一些使用其API的费用。

答案 2 :(得分:1)

我会做什么:

test2 <- str_replace_all(test, "[|]", ", ") #Same as you did

test3 <- unlist(strsplit(test2, split=", ")) #Turns string into a vector

check <- test3 %in% world.cities$name #Check if element vectors match list of city names

test3[check == TRUE] #Select vector elements that match list of city names

 [1] "San Diego"   "New Haven"   "Boston"      "Boston"      "Saint Louis" "New York"    "New York"    "New York"   
 [9] "New York"    "Charlotte"   "Cleveland"   "Nashville"   "Seattle"     "Washington" 

答案 3 :(得分:1)

另一种无循环方式

pat="(,.\\w+,)|(,.\\w+.\\w+,)"
gsub("(,\\s)|,","",regmatches(m<-strsplit(test,"\\|")[[1]],regexpr(pat,m)))

[1] "San Diego"   "New Haven"   "Boston"      "Boston"      "Saint Louis" "New York"    "New York"   
[8] "Charlotte"   "Cleveland"   "Nashville"   "Seattle"     "Gyeonggi-do" "Seoul"       "Seoul"      
[15] "Seoul"       "Seoul"    

此页面中给出的其他结果确实失败:例如,有一个名为Greonggi-do的城镇,其他解决方案中没有给出。还有一些代码将整个字符串作为城镇

答案 4 :(得分:1)

要扩展上面的@ hrbrmstr评论,您可以使用Stanford CoreNLP库对每个字符串执行命名实体识别(NER)。对这样一项事业的一个重要警告是,大多数NER注释者只能将一个标记注释为“位置”或等同物,这在城市与州和国家混在一起时并不是很有用。除了通常的NER注释器之外,CoreNLP确实包含一个额外的正则表达式NER注释器,可以将NER粒度提高到城市级别。

在R中,您可以使用coreNLP包来运行注释器。它需要rJava,在某些情况下可能很难配置。您还需要下载实际(非常大)的库,可以使用coreNLP::downloadCoreNLP来完成,如果需要,可以将CORENLP_HOME中的~/.Renviron环境变量设置为安装路径。

另请注意,这种方法相当慢且资源密集,因为它在Java中做了很多工作。

library(tidyverse)
library(coreNLP)

# set which annotators to use
writeLines('annotators = tokenize, ssplit, pos, lemma, ner, regexner\n', 'corenlp.properties')
initCoreNLP(libLoc = Sys.getenv('CORENLP_HOME'), parameterFile = 'corenlp.properties')
unlink('corenlp.properties')    # clean up

places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>% 
    separate_rows(string, sep = '\\|')    # separate strings

places_ner <- places %>% 
    mutate(annotations = map(string, annotateString),
           tokens = map(annotations, 'token'), 
           tokens = map(tokens, group_by, token_id = data.table::rleid(NER)), 
           city = map(tokens, filter, NER == 'CITY'), 
           city = map(city, summarise, city = paste(token, collapse = ' ')), 
           city = map_chr(city, ~if(nrow(.x) == 0) NA_character_ else .x$city))

返回

places_ner %>% select(city, string)
#> # A tibble: 17 x 2
#>    city      string                                                                               
#>    <chr>     <chr>                                                                                
#>  1 San Diego Ucsd Medical Center, San Diego, California, USA                                      
#>  2 New Haven Yale Cancer Center, New Haven, Connecticut, USA                                      
#>  3 Boston    Massachusetts General Hospital., Boston, Massachusetts, USA                          
#>  4 Boston    Dana Farber Cancer Institute, Boston, Massachusetts, USA                             
#>  5 NA        Washington University, Saint Louis, Missouri, USA                                    
#>  6 NA        Mount SInai Medical Center, New York, New York, USA                                  
#>  7 NA        Memorial Sloan Kettering Cancer Center, New York, New York, USA                      
#>  8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA                          
#>  9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA                      
#> 11 Seattle   Seattle Cancer Care Alliance, Seattle, Washington, USA                               
#> 12 NA        National Cancer Center, Gyeonggi-do, Korea, Republic of                              
#> 13 Seoul     Seoul National University Hospital, Seoul, Korea, Republic of                        
#> 14 Seoul     Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of       
#> 15 Seoul     Korea University Guro Hospital, Seoul, Korea, Republic of                            
#> 16 Seoul     Asan Medical Center., Seoul, Korea, Republic of                                      
#> 17 NA        VU MEDISCH CENTRUM; Dept. of Medical Oncology   

失败:

  • “纽约”被认定为州或省两次(“纽约市”将被承认为此类)。
  • “圣路易斯”被认为是一个人。 “圣路易斯”在我的安装中被识别为位置,但an online version of the same library将原始位置识别为位置,因此这可能是版本问题。
  • “京畿道”不被认可,但“首尔”是。我不确定regexner注释器的粒度是多少,但是(正如其名称所示)它由正则表达式起作用,有一个大小/熟悉度阈值,在该阈值下它不包含正则表达式。 You can add your own regex to it但是,如果它值得的话。

cleanNLP package还支持Stanford CoreNLP(以及其他一些后端),界面更易于使用(设置仍然很难),但据我所知,不允许使用{ {1}}目前由于它初始化CoreNLP的原因。

答案 5 :(得分:1)

您可以使用tidytext来提取二元组 - &gt;单词 - &gt;相交以获得共同部分

if