从Google街道地址

时间:2017-08-16 22:31:45

标签: r regex ggmap stringr tidyverse

我有一个包含不同点位置的纬度/经度信息的数据集,我想知道哪个城市和州与每个点相关联。

遵循此example我使用revgeocode中的ggmap函数获取每个位置的街道地址,生成以下数据框:

df <- structure(list(PointID = c(1787L, 2805L, 3025L, 3027L, 3028L, 
3029L, 3030L, 3031L, 3033L), Latitude = c(38.36648102, 36.19548585, 
43.419774, 43.437222, 43.454722, 43.452643, 43.411949, 43.255479, 
43.261464), Longitude = c(-76.4802046, -94.21554661, -87.960399, 
-88.018333, -87.974722, -87.978542, -87.94149, -87.986433, -87.968612
), Address = structure(c(2L, 8L, 5L, 3L, 9L, 7L, 4L, 1L, 6L), .Label = c("13004 N Thomas Dr, Mequon, WI 53097, USA", 
"2160 Turner Rd, Lusby, MD 20657, USA", "2805 County Rd Y, Saukville, WI 53080, USA", 
"3701-3739 County Hwy W, Saukville, WI 53080, USA", "3907 Echo Ln, Saukville, WI 53080, USA", 
"4823 W Bonniwell Rd, Mequon, WI 53097, USA", "5100-5260 County Rd I, Saukville, WI 53080, USA", 
"7948 W Gibbs Rd, Springdale, AR 72762, USA", "River Park Rd, Saukville, WI 53080, USA"
), class = "factor")), row.names = c(NA, -9L), class = "data.frame", .Names = c("PointID", 
"Latitude", "Longitude", "Address"))

我想使用R从完整的街道地址中提取城市/州信息,并创建两列来存储此信息(&#34; City&#34;和&#34; State)。

我假设stringr套餐是要走的路,但我不确定如何使用它。上面的example使用以下代码来提取邮政编码(在该示例中命名为&#34;结果&#34;)。他们的数据集:

#       ID Longitude  Latitude                                         result
# 1 311175  41.29844 -72.92918 16 Church Street South, New Haven, CT 06519, USA
# 2 292058  41.93694 -87.66984  1632 West Nelson Street, Chicago, IL 60657, USA
# 3  12979  37.58096 -77.47144    2077-2199 Seddon Way, Richmond, VA 23230, USA

提取邮政编码的代码:

library(stringr)
data$zipcode <- substr(str_extract(data$result," [0-9]{5}, .+"),2,6)
data[,-4]

是否可以轻松修改上述代码以获取城市和州数据?

3 个答案:

答案 0 :(得分:4)

您可以使用revgeocode()本身获取城市和州:

df <- cbind(df,do.call(rbind,
               lapply(1:nrow(df),
               function(i) 
               revgeocode(as.numeric(
               df[i,3:2]), output = "more")[c("administrative_area_level_1","locality")])))

df

#   PointID Latitude Longitude                                          Address 
# 1    1787 38.36648 -76.48020             2160 Turner Rd, Lusby, MD 20657, USA 
# 2    2805 36.19549 -94.21555       7948 W Gibbs Rd, Springdale, AR 72762, USA 
# 3    3025 43.41977 -87.96040           3907 Echo Ln, Saukville, WI 53080, USA 
# 4    3027 43.43722 -88.01833       2805 County Rd Y, Saukville, WI 53080, USA 
# 5    3028 43.45472 -87.97472          River Park Rd, Saukville, WI 53080, USA 
# 6    3029 43.45264 -87.97854  5100-5260 County Rd I, Saukville, WI 53080, USA 
# 7    3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA 
# 8    3031 43.25548 -87.98643         13004 N Thomas Dr, Mequon, WI 53097, USA 
# 9    3033 43.26146 -87.96861       4823 W Bonniwell Rd, Mequon, WI 53097, USA 
#   administrative_area_level_1   locality 
# 1                    Maryland      Lusby 
# 2                    Arkansas Springdale 
# 3                   Wisconsin  Saukville 
# 4                   Wisconsin  Saukville 
# 5                   Wisconsin  Saukville 
# 6                   Wisconsin  Saukville 
# 7                   Wisconsin  Saukville 
# 8                   Wisconsin     Mequon 
# 9                   Wisconsin     Mequon

P.S。 您可以一步完成所有操作(包括获取地址或/和邮政编码)。只需将"address"或/和"postal_code"添加到c("administrative_area_level_1","locality"),这是您要提取的变量列表。

答案 1 :(得分:2)

1)sub 像这样使用sub。不需要包裹。

正则表达式匹配start(^)后跟最短字符串,直到逗号和空格后跟最短字符串(代表城市),直到另一个逗号和空格后跟两个字符(表示状态),一个空格, 5个字符(代表邮政编码),逗号,空格,美国和字符串结尾。与括号部分的匹配可以通过\ 1,\ 2和\ 3引用,但在双引号内必须加倍。

如果您的邮政编码不是全部5位数,请尝试使用pat <- "^.*?, (.*?), (..) (.*), USA$"

pat <- "^.*?, (.*?), (..) (.....), USA$"
transform(df, City = sub(pat, "\\1", Address), 
              State = sub(pat, "\\2", Address), 
              Zip = sub(pat, "\\3", Address))

,并提供:

  PointID Latitude Longitude                                          Address       City State   Zip
1    1787 38.36648 -76.48020             2160 Turner Rd, Lusby, MD 20657, USA      Lusby    MD 20657
2    2805 36.19549 -94.21555       7948 W Gibbs Rd, Springdale, AR 72762, USA Springdale    AR 72762
3    3025 43.41977 -87.96040           3907 Echo Ln, Saukville, WI 53080, USA  Saukville    WI 53080
4    3027 43.43722 -88.01833       2805 County Rd Y, Saukville, WI 53080, USA  Saukville    WI 53080
5    3028 43.45472 -87.97472          River Park Rd, Saukville, WI 53080, USA  Saukville    WI 53080
6    3029 43.45264 -87.97854  5100-5260 County Rd I, Saukville, WI 53080, USA  Saukville    WI 53080
7    3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA  Saukville    WI 53080
8    3031 43.25548 -87.98643         13004 N Thomas Dr, Mequon, WI 53097, USA     Mequon    WI 53097
9    3033 43.26146 -87.96861       4823 W Bonniwell Rd, Mequon, WI 53097, USA     Mequon    WI 53097

2)read.pattern 另一种可能性是read.pattern与上述pat相同:

library(gsubfn)

cn <- c("City", "State", "Zip")
Address <- as.character(df$Address)
cbind(df, read.pattern(text = Address, pattern = pat, as.is = TRUE, col.names = cn))

答案 2 :(得分:2)

如果您想使用stringr,可以这样做:

library(stringr)
library(data.table)

parse_address <- function(address){

  address <- address %>% 
    str_split(",") %>% 
    .[[1]]
  state <- address %>% 
    .[3] %>% 
    str_replace_all("[^A-Z]","")

  zip <- address %>% 
    .[3] %>% 
    str_replace_all("[^0-9]","")

  city <- address %>% 
    .[2] %>% 
    str_trim()

  street <- address %>% 
    .[1] %>% 
    str_trim()

  data.table(street, city, state, zip)
}

lapply(df$Address, parse_address) %>% 
  rbindlist