我正在从Wikipedia上获取有关加拿大正向分类区域(FSA-邮政编码的前3位数字,加拿大的邮政编码)及其所属城市/区域的信息。此信息的示例如下:
library(rvest)
library(tidyverse)
URL <- paste0("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_", "K")
FSAs <- URL %>%
read_html() %>%
html_nodes(xpath = "//td") %>%
html_text()
head(FSAs)
[1] "K1AGovernment of CanadaOttawa and Gatineau offices (partly in QC)\n" "K2AOttawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
[3] "K4AOttawa(Fallingbrook)\n" "K6AHawkesbury\n"
[5] "K7ASmiths Falls\n" "K8APembrokeCentral and northern subdivisions\n"
我面临的问题是,我希望有一个数据框,其中每个弹簧的前3位数字放在一列中,而其余信息则在另一列中。我以为会有一个涉及stringr
之类的str_split()
函数的解决方案,但是这消除了前3位数字的模式,我当然不希望这样。实际上,我希望在每个字符串的第3个字符和第4个字符之间分割字符串。
我已经想出了这个解决方案,最后一点是从这个answer借来的,但是它确实令人难以置信。我的问题是,有更好的方法吗?
FSAs %>%
enframe(name = NULL) %>%
separate(value, c(NA, "Location"), sep = "^...", remove = FALSE) %>%
separate(value, c("FSA", NA), sep = "(?<=\\G...)")
# A tibble: 195 x 2
FSA Location
<chr> <chr>
1 K1A "Government of CanadaOttawa and Gatineau offices (partly in QC)\n"
2 K2A "Ottawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
3 K4A "Ottawa(Fallingbrook)\n"
4 K6A "Hawkesbury\n"
5 K7A "Smiths Falls\n"
6 K8A "PembrokeCentral and northern subdivisions\n"
7 K9A "Cobourg\n"
8 K1B "Ottawa(Blackburn Hamlet / Pine View / Sheffield Glen)\n"
9 K2B "Ottawa(Britannia /Whitehaven / Bayshore / Pinecrest)\n"
10 K4B "Ottawa(Navan)\n"