我想将地址解析(提取)到HouseNumber和Streetname。 我以后应该能够写出提取的"值"进入新栏目(商店$ HouseNumber和商店$ Streetname)。
所以我想说我有一个名为" shop":
的数据框> shops
Name city street
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
那么有没有办法将街道列分成两个列表,一个是街道名称,一个是门牌号码,包括" 1-3"," 14a"等等最后,结果可以分配给数据框,看起来像。
> shops
Name city Streetname HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
示例:Easyfakestreet 5 - > Easyfakestreet,5
由于我的一些街道字符串将具有带连字符的街道地址并且具有非数字组件,因此稍微复杂一些。
示例:新街3 - > ['新街',' 3']
一些复杂的案例街1-3 - > ['一些-复杂-Casestreet'' 1-3']
假街14a - > ['假街',' 14a']
我会感激一些帮助!
答案 0 :(得分:8)
这是一个可能的tidyr
解决方案
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a
答案 1 :(得分:5)
您可以尝试:
shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)
数据强>
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
<强>结果
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"
答案 2 :(得分:2)
创建一个模式,其背面引用与街道和数字相匹配,然后使用sub
依次替换每个反向引用。不需要包裹:
pat <- "(.*) (\\d.*)"
transform(shops,
street = sub(pat, "\\1", street),
HouseNumber = sub(pat, "\\2", street)
)
,并提供:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
以下是pat
:
(.*) (\d.*)
注意:
1)我们将此用于shops
:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2)David Arenburg的模式可以在这里交替使用。只需将pat
设置为它即可。上面的模式的优点是它允许在其中嵌入数字的街道名称,但大卫的优点是在街道号码之前可能缺少空间。
答案 3 :(得分:0)
您可以使用软件包 unglue
library(unglue)
unglue_unnest(shops, street, "{street} {value=\\d.*}")
#> Name city street value
#> 1 Something Fakecity New Street 3
#> 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
#> 3 SomethingDifferent Fakecity Fake Street 14a
由reprex package(v0.3.0)于2019-10-08创建
答案 4 :(得分:0)
国际地址非常复杂的问题
$re = '/(\d+[\d\/\-\. ,]*[ ,\d\-\w]{0,2} )/m';
$str = '234 Test Road, Testville
456b Tester Road, Testville
789 c Tester Road, Testville
Mystreet 14a
123/3 dsdsdfs
Roobertinkatu 36-40
Flats 1-24 Acacia Avenue
Apartment 9D, 1 Acacia Avenue
Flat 24, 1 Acacia Avenue
Moscow Street, plot,23 building 2
Apartment 5005 no. 7 lane 31 Wuming Rd
Quinta da Redonda Lote 3 - 1 º
102 - 3 Esq
Av 1 Maio 16,2 dt,
Rua de Ceuta Lote 1 Loja 5
11334 Nc Highway 72 E ';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);