将地址字符串拆分为R中的城市,州和地址

时间:2017-07-24 19:55:52

标签: r split

我在下面给出了一个字符串形式的地址:

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
                               "1626 Aviation Way, Augusta, GA 30906, USA", 
                               "325 Main St, Stratford, CT 06615, USA", 
                               "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)

我想把它分为5个栏目,如街道,城市,州,邮政编码,邮政。 我怎么能在R。

中这样做

3 个答案:

答案 0 :(得分:2)

我用一行代码解决了它。对于正则表达式专家来说可能看起来有点天真,但对于它的工作样本数据。

library(stringr)

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
                               "1626 Aviation Way, Augusta, GA 30906, USA", 
                               "325 Main St, Stratford, CT 06615, USA", 
                               "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)

str_match(dat$Addresses,"(.+), (.+), (.+) (.+), (.+)")[ ,-1]
      [,1]                       [,2]          [,3] [,4]    [,5] 
[1,] "1626 Aviation Way"        "Albuquerque" "NM" "30906" "USA"
[2,] "1626 Aviation Way"        "Augusta"     "GA" "30906" "USA"
[3,] "325 Main St"              "Stratford"   "CT" "06615" "USA"
[4,] "4205 Bessie Coleman Blvd" "Tampa"       "FL" "33607" "USA"

答案 1 :(得分:1)

这最终成了很多步骤。你可以用更少的东西来做这件事,但这就是我做到的。我还假设yoru数据在数据帧中以每行一个地址开始。

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
                 "1626 Aviation Way, Augusta, GA 30906, USA", 
                 "325 Main St, Stratford, CT 06615, USA", 
                 "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)

> dat
                                       Addresses
1  1626 Aviation Way, Albuquerque, NM 30906, USA
2      1626 Aviation Way, Augusta, GA 30906, USA
3          325 Main St, Stratford, CT 06615, USA
4 4205 Bessie Coleman Blvd, Tampa, FL 33607, USA

现在,我们需要在逗号上拆分开始,然后再将状态和zip分开。我也将通过分割逗号来删除附加的空格。

dat2 = sapply(dat$Addresses, strsplit, ",")
dat2 = lapply(dat2, trimws)

> dat2
$`1626 Aviation Way, Albuquerque, NM 30906, USA`
[1] "1626 Aviation Way" "Albuquerque"       "NM 30906"          "USA"              

$`1626 Aviation Way, Augusta, GA 30906, USA`
[1] "1626 Aviation Way" "Augusta"           "GA 30906"          "USA"              

$`325 Main St, Stratford, CT 06615, USA`
[1] "325 Main St" "Stratford"   "CT 06615"    "USA"        

$`4205 Bessie Coleman Blvd, Tampa, FL 33607, USA`
[1] "4205 Bessie Coleman Blvd" "Tampa"                    "FL 33607"                 "USA"    

现在,我们需要将其恢复到数据框中。

dat2 = data.frame(matrix(unlist(dat2), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE)

> dat2
                        X1          X2       X3  X4
1        1626 Aviation Way Albuquerque NM 30906 USA
2        1626 Aviation Way     Augusta GA 30906 USA
3              325 Main St   Stratford CT 06615 USA
4 4205 Bessie Coleman Blvd       Tampa FL 33607 USA

接下来,我们可以将x3拆分为state和zip,然后删除该列。

dat2$State = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][1])
dat2$Zip = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][2])

dat2 = dat2[, -3]

> dat2
                        X1          X2  X4 State   Zip
1        1626 Aviation Way Albuquerque USA    NM 30906
2        1626 Aviation Way     Augusta USA    GA 30906
3              325 Main St   Stratford USA    CT 06615
4 4205 Bessie Coleman Blvd       Tampa USA    FL 33607

最后,我们可以设置列名称,我们已完成。

colnames(dat2) = c("Street", "City", "Country", "State", "Zip")
> dat2
                    Street        City Country State   Zip
1        1626 Aviation Way Albuquerque     USA    NM 30906
2        1626 Aviation Way     Augusta     USA    GA 30906
3              325 Main St   Stratford     USA    CT 06615
4 4205 Bessie Coleman Blvd       Tampa     USA    FL 33607

答案 2 :(得分:-1)

使用我的包裹tfwstring

可自动处理任何地址类型,甚至带有前缀和后缀。

if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")

要解析地址:

tfwstring::parseaddress(address, check_python=TRUE, force_stateabb=FALSE, return="char")

在Mac和Linux上,如果缺少python和python模块usaddress,则python模块pip3 install usaddress应该会自动安装(因为unix显然更好)。

在Windows上,建议

  1. 从此处自己安装python:https://www.python.org/downloads/windows/
  2. 为Visual Studio安装c ++和python工具:https://visualstudio.microsoft.com/downloads/
  3. 使用powershell来parseaddress()
  4. 运行check_python=FALSE时,请使用> dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", + "1626 Aviation Way, Augusta, GA 30906, USA", + "325 Main St, Stratford, CT 06615, USA", + "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE) > parseaddress(dat$Addresses[1]) AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName "1626" "Aviation" "Way" "Albuquerque" "NM" "30906" "USA" > parseaddress(dat$Addresses[2]) AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName "1626" "Aviation" "Way" "Augusta" "GA" "30906" "USA" > parseaddress(dat$Addresses[3]) AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName "325" "Main" "St" "Stratford" "CT" "06615" "USA" > parseaddress(dat$Addresses[4]) AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName "4205" "Bessie Coleman" "Blvd" "Tampa" "FL" "33607" "USA" 以避免在Windows操作系统中运行问题。
ReflectionTestUtils.setField(a, "someService", someService);