在R

时间:2019-11-14 21:50:28

标签: r string stringr

我正在从Wikipedia上获取有关加拿大正向分类区域(FSA-邮政编码的前3位数字,加拿大的邮政编码)及其所属城市/区域的信息。此信息的示例如下:

library(rvest)
library(tidyverse)

URL <- paste0("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_", "K")

FSAs <- URL %>% 
  read_html() %>% 
  html_nodes(xpath = "//td") %>% 
  html_text()

head(FSAs)
[1] "K1AGovernment of CanadaOttawa and Gatineau offices (partly in QC)\n"            "K2AOttawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
[3] "K4AOttawa(Fallingbrook)\n"                                                      "K6AHawkesbury\n"                                                               
[5] "K7ASmiths Falls\n"                                                              "K8APembrokeCentral and northern subdivisions\n"     

我面临的问题是,我希望有一个数据框,其中每个弹簧的前3位数字放在一列中,而其余信息则在另一列中。我以为会有一个涉及stringr之类的str_split()函数的解决方案,但是这消除了前3位数字的模式,我当然不希望这样。实际上,我希望在每个字符串的第3个字符和第4个字符之间分割字符串。

我已经想出了这个解决方案,最后一点是从这个answer借来的,但是它确实令人难以置信。我的问题是,有更好的方法吗?

FSAs %>% 
  enframe(name = NULL) %>%
  separate(value, c(NA, "Location"), sep = "^...", remove = FALSE) %>% 
  separate(value, c("FSA", NA), sep = "(?<=\\G...)")

# A tibble: 195 x 2
   FSA   Location                                                                     
   <chr> <chr>                                                                        
 1 K1A   "Government of CanadaOttawa and Gatineau offices (partly in QC)\n"           
 2 K2A   "Ottawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
 3 K4A   "Ottawa(Fallingbrook)\n"                                                     
 4 K6A   "Hawkesbury\n"                                                               
 5 K7A   "Smiths Falls\n"                                                             
 6 K8A   "PembrokeCentral and northern subdivisions\n"                                
 7 K9A   "Cobourg\n"                                                                  
 8 K1B   "Ottawa(Blackburn Hamlet / Pine View / Sheffield Glen)\n"                    
 9 K2B   "Ottawa(Britannia /Whitehaven / Bayshore / Pinecrest)\n"                     
10 K4B   "Ottawa(Navan)\n" 

0 个答案:

没有答案