我在下面给出了一个字符串形式的地址:
dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA",
"1626 Aviation Way, Augusta, GA 30906, USA",
"325 Main St, Stratford, CT 06615, USA",
"4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)
我想把它分为5个栏目,如街道,城市,州,邮政编码,邮政。 我怎么能在R。
中这样做答案 0 :(得分:2)
我用一行代码解决了它。对于正则表达式专家来说可能看起来有点天真,但对于它的工作样本数据。
library(stringr)
dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA",
"1626 Aviation Way, Augusta, GA 30906, USA",
"325 Main St, Stratford, CT 06615, USA",
"4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)
str_match(dat$Addresses,"(.+), (.+), (.+) (.+), (.+)")[ ,-1]
[,1] [,2] [,3] [,4] [,5]
[1,] "1626 Aviation Way" "Albuquerque" "NM" "30906" "USA"
[2,] "1626 Aviation Way" "Augusta" "GA" "30906" "USA"
[3,] "325 Main St" "Stratford" "CT" "06615" "USA"
[4,] "4205 Bessie Coleman Blvd" "Tampa" "FL" "33607" "USA"
答案 1 :(得分:1)
这最终成了很多步骤。你可以用更少的东西来做这件事,但这就是我做到的。我还假设yoru数据在数据帧中以每行一个地址开始。
dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA",
"1626 Aviation Way, Augusta, GA 30906, USA",
"325 Main St, Stratford, CT 06615, USA",
"4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)
> dat
Addresses
1 1626 Aviation Way, Albuquerque, NM 30906, USA
2 1626 Aviation Way, Augusta, GA 30906, USA
3 325 Main St, Stratford, CT 06615, USA
4 4205 Bessie Coleman Blvd, Tampa, FL 33607, USA
现在,我们需要在逗号上拆分开始,然后再将状态和zip分开。我也将通过分割逗号来删除附加的空格。
dat2 = sapply(dat$Addresses, strsplit, ",")
dat2 = lapply(dat2, trimws)
> dat2
$`1626 Aviation Way, Albuquerque, NM 30906, USA`
[1] "1626 Aviation Way" "Albuquerque" "NM 30906" "USA"
$`1626 Aviation Way, Augusta, GA 30906, USA`
[1] "1626 Aviation Way" "Augusta" "GA 30906" "USA"
$`325 Main St, Stratford, CT 06615, USA`
[1] "325 Main St" "Stratford" "CT 06615" "USA"
$`4205 Bessie Coleman Blvd, Tampa, FL 33607, USA`
[1] "4205 Bessie Coleman Blvd" "Tampa" "FL 33607" "USA"
现在,我们需要将其恢复到数据框中。
dat2 = data.frame(matrix(unlist(dat2), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE)
> dat2
X1 X2 X3 X4
1 1626 Aviation Way Albuquerque NM 30906 USA
2 1626 Aviation Way Augusta GA 30906 USA
3 325 Main St Stratford CT 06615 USA
4 4205 Bessie Coleman Blvd Tampa FL 33607 USA
接下来,我们可以将x3拆分为state和zip,然后删除该列。
dat2$State = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][1])
dat2$Zip = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][2])
dat2 = dat2[, -3]
> dat2
X1 X2 X4 State Zip
1 1626 Aviation Way Albuquerque USA NM 30906
2 1626 Aviation Way Augusta USA GA 30906
3 325 Main St Stratford USA CT 06615
4 4205 Bessie Coleman Blvd Tampa USA FL 33607
最后,我们可以设置列名称,我们已完成。
colnames(dat2) = c("Street", "City", "Country", "State", "Zip")
> dat2
Street City Country State Zip
1 1626 Aviation Way Albuquerque USA NM 30906
2 1626 Aviation Way Augusta USA GA 30906
3 325 Main St Stratford USA CT 06615
4 4205 Bessie Coleman Blvd Tampa USA FL 33607
答案 2 :(得分:-1)
使用我的包裹tfwstring
可自动处理任何地址类型,甚至带有前缀和后缀。
if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")
tfwstring::parseaddress(address, check_python=TRUE, force_stateabb=FALSE, return="char")
在Mac和Linux上,如果缺少python和python模块usaddress
,则python模块pip3 install usaddress
应该会自动安装(因为unix显然更好)。
在Windows上,建议
parseaddress()
check_python=FALSE
时,请使用> dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA",
+ "1626 Aviation Way, Augusta, GA 30906, USA",
+ "325 Main St, Stratford, CT 06615, USA",
+ "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE)
> parseaddress(dat$Addresses[1])
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName
"1626" "Aviation" "Way" "Albuquerque" "NM" "30906" "USA"
> parseaddress(dat$Addresses[2])
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName
"1626" "Aviation" "Way" "Augusta" "GA" "30906" "USA"
> parseaddress(dat$Addresses[3])
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName
"325" "Main" "St" "Stratford" "CT" "06615" "USA"
> parseaddress(dat$Addresses[4])
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode CountryName
"4205" "Bessie Coleman" "Blvd" "Tampa" "FL" "33607" "USA"
以避免在Windows操作系统中运行问题。ReflectionTestUtils.setField(a, "someService", someService);