我一直在试图弄清楚如何清理和编辑数据集中的列。
我使用的数据集应该仅适用于旧金山市。名为“city”的数据集中的列包含旧金山以及其他城市的多种不同拼写。这是它的样子:
表(SF $市)
Brentwood CA
30401 18 370
DALY CITY FOSTER CITY HAYWARD
0 0 0
Novato Oakland OAKLAND
0 40 0
S F S.F. s.F. Ca
0 31428 12
SAN BRUNO SAN FRANCICSO San Franciisco
0 221 54
san francisco san Francisco San francisco
20 284 0
San Francisco SAN FRANCISCO san Francisco CA
78050 16603 6
San Francisco, San Francisco, Ca San Francisco, CA
12 4 72
San Francisco, CA 94132 San Franciscvo San Francsico
0 0 2
San Franicisco Sand Francisco sf
41 30 17
Sf SF SF , CA
214 81226 1
SF CA 94133 SF, CA SF, CA 94110
0 9 38
SF, CA 94115 SF. SF`
4 1656 31
SO. SAN FRANCISCO SO.S.F.
0 6
我要做的是将sf $ city改为只有“旧金山”。因此,sf $ city中的所有数据都将放置在旧金山的一个城市之下。因此,当我输入table(sf $ city)时,它只显示旧金山。
我可以分组吗?类似的东西:
sf$city = subset(sf, city == "S.F." & "s.F. Ca" & "SAN FRANCICSO" & ...
并列出我想要的所有城市变量?或者这会扭曲和搞乱我的数据?
答案 0 :(得分:3)
我会尝试使用agrep
和grep
的正则表达式。
示例数据:
d <- c("Brentwood", "CA", "DALY CITY", "FOSTER CITY", "HAYWARD", "Novato",
"Oakland", "OAKLAND", "S F", "S.F.", "s.F. Ca", "SAN BRUNO",
"SAN FRANCICSO", "San Franciisco", "san francisco", "san Francisco",
"San francisco", "San Francisco", "SAN FRANCISCO", "san Francisco CA",
"San Francisco,", "San Francisco, Ca", "San Francisco, CA", "San Francisco, CA 94132",
"San Franciscvo", "San Francsico", "San Franicisco", "Sand Francisco",
"sf", "Sf", "SF", "SF , CA", "SF CA", "94133", "SF, CA", "SF, CA 94110",
"SF, CA 94115", "SF.", "SF`", "SO. SAN FRANCISCO", "SO.S.F.")
您可以定位像&#34; San Francisco&#34;使用agrep
,默认值max.dist = 0.1在此处运行良好。然后你可以只针对S.F.使用grep
d[agrep("San Francisco", d, ignore.case = TRUE, max.dist = 0.1)] <- "San Francisco"
d[grep("\\bS[. ]?F\\.?\\b", d, ignore.case = TRUE, perl = TRUE)] <- "San Francisco"
# [1] "Brentwood" "CA" "DALY CITY" "FOSTER CITY"
# [5] "HAYWARD" "Novato" "Oakland" "OAKLAND"
# [9] "San Francisco" "San Francisco" "San Francisco" "SAN BRUNO"
#[13] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[17] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[21] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[25] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[29] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[33] "San Francisco" "94133" "San Francisco" "San Francisco"
#[37] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[41] "San Francisco"
adist
是定位像#34; San Francisco&#34;等字词的另一种选择。我发现以下设置运行良好。你可以选择&#34; San Fran&#34;:
d[adist("San Francisco", d, ignore.case = TRUE,
cost = c(del = 0.5, ins = 0.5, sub = 3)) < 3] <- "San Francisco"
答案 1 :(得分:1)
要了解@ jeta的回答,您还可以获取结果数据集并通过Google Maps API运行,如下所示:https://gist.github.com/josecarlosgonz/6417633
具体来说,使用该链接上的可用功能,您可以获取grep()
输出并运行
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
这会给你以下输出:
# V1 V2 V3 V4
# 1 36.0331164 -86.7827772 APPROXIMATE Brentwood, TN, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
由于您知道您的所有位置都在CA中,您可能还想将CA附加到矢量的末尾,如下所示:
d[grep("CA", d, invert = TRUE)] <- paste0(d[grep("CA", d, invert = TRUE)], ", CA")
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
如下图所示,这将确保Google将Brentwood放入CA.
这种方法的优势在于您最终会使用V4
中的规范化城市,这在过滤和其他方面可能会有所帮助。
# V1 V2 V3 V4
# 1 37.931868 -121.6957863 APPROXIMATE Brentwood, CA 94513, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
注意:Google对其API有速率限制。如果您想避免注册和获取API密钥,您将需要按照上面Github链接中的注释中的建议,以10秒的速度对ldply
进行分块。
答案 2 :(得分:0)
要为每个条目覆盖sf$city
为“旧金山”,这是典型的方法:
sf$city <- "San Francisco"
但是,如果您的某些观察结果不是旧金山,并且您想放弃这些观察,那么您将首先放弃这些观察。这是一个开始:
# drop non-SF observations
sfReal <- sf[!(tolower(sf$city) %in% c("daly city", "brentwood", "hayward", "oakland"))]
我的地理位置不是最好的,所以我可能会错过一些。或者,您可以使用%in%
仅包含旧金山的观察结果。鉴于您上面提供的设置,我怀疑是这种情况。
将来,如果这是一个重复的任务,您应该查看正则表达式和grep
。这是一个神奇的工具,它将为字符串操作任务支付巨额红利。 @jota在提供的答案中提供了一个很好的方法。