无法理解R中正则表达式函数的逻辑,用于字匹配

时间:2018-01-31 16:22:47

标签: r regex

假设我有一个长字符,其中包括城市名称和其他人之间的州名。

test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"

我的目标是提取它的所有城市名称。我在一些帮助下通过应用实现了它:

pat="(,.\\w+,)|(,.\\w+.\\w+,)"
gsub("(,\\s)|,","",regmatches(m<-strsplit(test,"\\|")[[1]],regexpr(pat,m)))

现在的问题是我想对状态做同样的事情,但我无法完全理解上面代码的逻辑。有什么帮助吗?

1 个答案:

答案 0 :(得分:1)

您可以使用str_extract_all中的stringr

library(stringr)
str_extract_all(test, "(?<=,\\s)[\\w\\s]+(?=,[\\w\\s]+(\\||$))")

<强>结果:

[[1]]
 [1] "California"     "Connecticut"    "Massachusetts"  "Massachusetts"  "Missouri"       "New York"      
 [7] "New York"       "North Carolina" "Ohio"           "Tennessee"      "Washington"     "Korea"         
[13] "Korea"          "Korea"          "Korea"          "Korea"  

备注:

  1. [\\w\\s]+匹配任何单词字符或空格一次或多次

  2. (?<=,\\s)是一个与逗号和空格匹配的正面后视

  3. (?=,[\\w\\s]+(\\||$))是一个积极的前瞻,它与逗号,空格或单词字符匹配一次或多次,|或字符串的结尾

  4. 整个模式只有在跟随逗号和空格后跟逗号,空格时,才能匹配任何单词字符或空格一次或多次或单词字符一次或多次以及|或字符串的结尾。基本上,这匹配每个位置的第二个最后一个元素,用逗号分隔。

  5. 另一种方法是嵌套str_split方法,它按|sapply str_split分割到每个元素,第二次按,分割。此方法不需要包,但假定状态始终是每个位置的第三个元素:

    unname(sapply(unlist(str_split(test, "\\|")), 
                  function(x) unlist(str_split(x, ", "))[3]))
    

    <强>结果:

     [1] "California"     "Connecticut"    "Massachusetts"  "Massachusetts"  "Missouri"       "New York"      
     [7] "New York"       "North Carolina" "Ohio"           "Tennessee"      "Washington"     "Korea"         
    [13] "Korea"          "Seoul"          "Korea"          "Korea"          NA 
    

    请注意,最后一个元素是NA,因为它没有第三个元素。