我的数据框具有以下列结构(总计超过1000行):
addressfull
POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map
POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map
POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map
POINT(2.915206183 24.315583523)||DEF_32||--||13||map
POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map
structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map",
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map",
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map",
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
该列包含位置,街道,门牌号,邮政编码,城市和国家。我想在多个列中用R拆分addressfull列,例如:
street house number zip city country
molengraaf 20 1689 GL Utrecht Netherlands
winkellaan 67 5788 BG Amsterdam Netherlands
vermeerstraat 18 0932 DC Rotterdam Netherlands
na na na na na
Zandhorstlaan 122 0823 GT Ochtrup Germany
我已经阅读了tidyr和stringer文档。我可以看到用于从位置x分割(由“)”,“ |”和“,”分割的模式。但我找不到正确的代码将列拆分为多列。
有人可以帮助我吗?
答案 0 :(得分:2)
对于基本的R方法,您可以使用sub
对其进行暴力破解:
df$steet <- sub("^(\\S+)\\s+.*$", "\\1", df$adressfull)
df$`house number` <- sub("^\\S+\\s+(\\d+).*$", "\\1", df$adressfull)
df$zip <- sub("^\\S+\\s+\\d+,\\s*(\\d+\\s+[A-Z]+).*$", "\\1", df$adressfull)
df$city <- sub("^.*?(\\S+),\\s*\\S+$", "\\1", df$adressfull)
df$country <- sub("^.*,\\s*(\\S+)$", "\\1", df$adressfull)
df
adressfull steet house number zip
1 molengraaf 20, 1689 GL Utrecht, Netherlands molengraaf 20 1689 GL
city country
1 Utrecht Netherlands
数据:
df <- data.frame(adressfull=c("molengraaf 20, 1689 GL Utrecht, Netherlands"),
stringsAsFactors=FALSE)
这假设我们已经隔离了地址文本。为此,请考虑:
text <- "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map"
addresfull <- unlist(strsplit(text, "\\|\\|"))[3]
addresfull
[1] "molengraaf 20, 1689 GL Utrecht, Netherlands"
答案 1 :(得分:0)
这将是解决问题的一种整洁方法:
library(tidyverse)
df <- structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map",
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map",
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map",
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
df %>% separate(addressfull, sep = "\\|\\|", into = c("Coords", "DEF", "ADDRESS"),extra = "drop") %>%
select(ADDRESS) %>%
separate(ADDRESS, sep = ",", into = c("street", "city", "country")) %>%
separate(street, sep = "(?= \\d)", into = c("street", "house_number")) %>%
separate(city, sep = "(?<=[A-Z][A-Z])", into = c("zip", "city"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> street house_number zip city country
#> 1 molengraaf 20 1689 GL Utrecht Netherlands
#> 2 winkellaan 67 5788 BG Amsterdam Netherlands
#> 3 vermeerstraat 18 0932 DC Rotterdam Netherlands
#> 4 -- <NA> <NA> <NA> <NA>
#> 5 Zandhorstlaan 122 0823 GT Ochtrup Germany
由reprex软件包(v0.3.0)于2020-02-27创建
答案 2 :(得分:0)
stringr
解决方案是这样:
addresssplit <- data.frame(
street = str_extract(addressfull$addressfull, "(?<=DEF_\\d{2}\\|\\|)\\w+\\b"),
number = str_extract(addressfull$addressfull, "\\d{1,}(?=,)"),
zip = str_extract(addressfull$addressfull, "(?<=\\s)\\d{4}\\s[A-Z]{2}"),
city = str_extract(addressfull$addressfull, "(?<=\\d{4}\\s[A-Z]{2}\\s)\\w+"),
country = str_extract(addressfull$addressfull, "(?<=[a-z]\\b,\\s)\\w+\\b")
)
结果:
addresssplit
street number zip city country
1 molengraaf 20 1689 GL Utrecht Netherlands
2 winkellaan 67 5788 BG Amsterdam Netherlands
3 vermeerstraat 18 0932 DC Rotterdam Netherlands
4 <NA> <NA> <NA> <NA> <NA>
5 Zandhorstlaan 122 0823 GT Ochtrup Germany
数据:
addressfull <- structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map",
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map",
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map",
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))