R:不可能的任务?如何将“纽约”分配到一个县

时间:2017-11-28 23:21:42

标签: r dplyr match geocoding acs

我遇到了将县分配到某些城市的问题。通过acs

查询时
> geo.lookup(state = "NY", place = "New York")
  state state.name                                                                 county.name place             place.name
1    36   New York                                                                        <NA>    NA                   <NA>
2    36   New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000          New York city
3    36   New York                                                               Oneida County 51011 New York Mills village

,你可以看到,“纽约”,例如,有一堆县。洛杉矶,波特兰,俄克拉荷马,哥伦布等也是如此。如何将这些数据分配给“县”?

以下代码目前用于将“county.name”与相应的县FIPS代码进行匹配。不幸的是,它仅适用于查询中只有一个县名输出的情况。

脚本

dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat

library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information

GeoLookup <- lapply(dat,function(x) {
  geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})

df <- bind_rows(GeoLookup)

#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")

# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).

df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df

返回:

  state    state.name                                                                  county.name place           place.name state.abb statefips countyfips
1    36      New York  Bronx County, Kings County, New York County, Queens County, Richmond County 51000        New York city      <NA>      <NA>       <NA>
2    25 Massachusetts                                                               Suffolk County  7000          Boston city        MA        25        025
3     6    California                                                           Los Angeles County 20802 East Los Angeles CDP        CA        06        037
4    48         Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000          Dallas city      <NA>      <NA>       <NA>
5     6    California                                                             San Mateo County 20956  East Palo Alto city        CA        06        081

为了保留数据,最好将 left_join 匹配为“查找包含county.name的{​​{1}}(不附加xy city 在名称中),或默认选择第一项。很高兴看到如何做到这一点。

总的来说:我认为,没有比这种方法更好的方法了吗?

感谢您的帮助!

2 个答案:

答案 0 :(得分:2)

下面的代码如何为加入创建“长”数据框。我们使用tidyverse管道运算符来链接操作。 strsplit返回一个列表,我们unnest将列表值(与state.nameplace.name的每个组合一起使用的县名称)堆叠到一个长数据框中county.name现在有了自己的行。

library(tigris)
library(acs)  
library(tidyverse)

dat = geo.lookup(state = "NY", place = "New York")  
  state state.name                                                                 county.name place             place.name
1    36   New York                                                                        <NA>    NA                   <NA>
2    36   New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000          New York city
3    36   New York                                                               Oneida County 51011 New York Mills village
dat = dat %>% 
  group_by(state.name, place.name) %>% 
  mutate(county.name = strsplit(county.name, ", ")) %>% 
  unnest
  state state.name place             place.name     county.name
  <chr>      <chr> <int>                  <chr>           <chr>
1    36   New York    NA                   <NA>            <NA>
2    36   New York 51000          New York city    Bronx County
3    36   New York 51000          New York city    Kings County
4    36   New York 51000          New York city New York County
5    36   New York 51000          New York city   Queens County
6    36   New York 51000          New York city Richmond County
7    36   New York 51011 New York Mills village   Oneida County

更新:关于评论中的第二个问题,假设您已经拥有都市区的向量,那么:

dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")

df <- map_df(strsplit(dat, ", "), function(x) {
  geo.lookup(state = x[2], place = x[1])[-1, ] %>% 
    group_by(state.name, place.name) %>%
    mutate(county.name = strsplit(county.name, ", ")) %>%
    unnest
})

df
   state    state.name place             place.name        county.name
 1    36      New York 51000          New York city       Bronx County
 2    36      New York 51000          New York city       Kings County
 3    36      New York 51000          New York city    New York County
 4    36      New York 51000          New York city      Queens County
 5    36      New York 51000          New York city    Richmond County
 6    36      New York 51011 New York Mills village      Oneida County
 7    25 Massachusetts  7000            Boston city     Suffolk County
 8    25 Massachusetts  7000            Boston city     Suffolk County
 9     6    California 20802   East Los Angeles CDP Los Angeles County
10     6    California 39612   Lake Los Angeles CDP Los Angeles County
11     6    California 44000       Los Angeles city Los Angeles County
12    48         Texas 19000            Dallas city      Collin County
13    48         Texas 19000            Dallas city      Dallas County
14    48         Texas 19000            Dallas city      Denton County
15    48         Texas 19000            Dallas city     Kaufman County
16    48         Texas 19000            Dallas city    Rockwall County
17    48         Texas 40516       Lake Dallas city      Denton County
18     6    California 20956    East Palo Alto city   San Mateo County
19     6    California 55282         Palo Alto city Santa Clara County

更新2:如果我理解你的评论,对于有多个县的城市(实际上是地名),我们只想要包含与城市名称相同的县(对于例如,在纽约市的纽约县),或列表中的第一个县。以下代码选择与城市同名的县,如果没有,则选择该城市的第一个县。您可能需要稍微调整一下以使其适用于整个美国。例如,要使其适用于路易斯安那州,您可能需要gsub(" County| Parish"...而不是gsub(" County"...

map_df(strsplit(dat, ", "), function(x) {
  geo.lookup(state = x[2], place = x[1])[-1, ] %>% 
    group_by(state.name, place.name) %>%
    mutate(county.name = strsplit(county.name, ", ")) %>%
    unnest %>% 
    slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
   state    state.name place             place.name        county.name
   <chr>         <chr> <int>                  <chr>              <chr>
 1    36      New York 51000          New York city    New York County
 2    36      New York 51011 New York Mills village      Oneida County
 3    25 Massachusetts  7000            Boston city     Suffolk County
 4     6    California 20802   East Los Angeles CDP Los Angeles County
 5     6    California 39612   Lake Los Angeles CDP Los Angeles County
 6     6    California 44000       Los Angeles city Los Angeles County
 7    48         Texas 19000            Dallas city      Dallas County
 8    48         Texas 40516       Lake Dallas city      Denton County
 9     6    California 20956    East Palo Alto city   San Mateo County
10     6    California 55282         Palo Alto city Santa Clara County

答案 1 :(得分:1)

您可以使用类似下面代码的方式准备数据吗?

{{1}}

它有点乱,你需要plyr和stringr包。准备好数据后,您应该能够加入数据