假设我列出了具有不同拼写错误数量的县或其他问题,以区别于2010 FIPS dataset(下面创建fips
数据框的代码),但是拼写错误的县的状态驻留输入正确。这是我的完整数据集中的21个随机观察中的sample
:
tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
"Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline",
"Richmond City", "Saint Mary", "Saint Louis City", "Mclean",
"Union", "Bienville", "Covington City", "Martinsville City",
"Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA",
"LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN",
"MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -21L))
county state
1 Beauregard LA
2 De Soto LA
3 Dekalb GA
4 Webster LA
5 Saint Joseph IN
6 West Feliciana LA
7 Ketchikan Gateway AK
8 Evangeline LA
9 Richmond City VA
10 Saint Mary LA
11 Saint Louis City MO
12 Mclean KY
13 Union LA
14 Bienville LA
15 Covington City VA
16 Martinsville City VA
17 Claiborne LA
18 King And Queen VA
19 Mclean ND
20 Mcminn TN
21 Prince Georges MD
我已经使用adist
创建了一个模糊字符串匹配算法,该算法将我县中约80%的县与fips
中的县名相匹配。然而,有时它会匹配两个拼写相似的县,但来自不同的州(例如,#34; Webster,LA&#34;匹配到#34; Webster,GA&#34;而不是&#34; Webster Parrish ,洛杉矶&#34;)。
distance <- adist(tomatch$county,
fips$countyname,
partial = TRUE)
min.name <- apply(distance, 1, min)
matchedcounties <- NULL
for(i in 1:nrow(distance)) {
s2.i <- match(min.name[i], distance[i, ])
s1.i <- i
matchedcounties <- rbind(data.frame(s2.i = s2.i,
s1.i = s1.i,
s1name = tomatch[s1.i, ]$county,
s2name = fips[s2.i, ]$countyname,
adist = min.name[i]),
matchedcounties)
}
因此,我想将县的模糊字符串匹配限制为具有匹配状态的正确拼写版本。
我当前的算法制作了一个大矩阵,用于计算两个源之间的标准Levenshtein距离,然后选择具有最小距离的值。
为了解决我的问题,我猜测我需要创建一个可以应用于每个州的功能&#39;按ddply
分组,但我对如何指示ddply
函数中的组值应与其他数据框匹配感到困惑。使用任何其他软件包的dplyr
解决方案或解决方案也将受到赞赏。
创建FIPS数据集的代码:
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
'./nationalfips.txt')
fips <- read.csv('./nationalfips.txt',
stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')
# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
答案 0 :(得分:2)
这是dplyr的一种方式。我首先使用州的FIPS名称加入tomatch
data.frame(仅允许在州内匹配):
require(dplyr)
df <- tomatch %>%
left_join(fips, by="state")
接下来,我注意到很多县没有'圣'但是'圣'在FIPS数据集中。首先清理它可以改善所获得的结果。
df <- df %>%
mutate(county_clean = gsub("Saint", "St.", county))
然后,按县分组此data.frame,并使用adist计算距离:
df <- df %>%
group_by(county_clean) %>% # Calculate the distance per county
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
arrange(county, dist) # Used this for visual inspection.
请注意,我从结果矩阵中取出对角线,因为adist返回一个n x m矩阵,其中n表示x向量,m表示y向量(它计算所有组合)。 或者,您可以添加agrep结果:
df <- df %>%
rowwise() %>% # 'group_by' a single row.
mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() # Always a good idea to remove 'groups' after you're done.
然后像过去一样过滤,采取最小距离:
df <- df %>%
group_by(county_clean) %>% # Causes it to calculate the 'min' per group
filter(dist == min(dist)) %>%
ungroup()
请注意,这可能会导致为tomatch
中的每个输入行返回多行
或者,在一次运行中完成所有操作(一旦我确信它正在执行它应该执行的操作,我通常会将代码更改为此格式):
df <- tomatch %>%
# Join on all names in the relevant state and clean 'St.'
left_join(fips, by="state") %>%
mutate(county_clean = gsub("Saint", "St.", county)) %>%
# Calculate the distances, per original county name.
group_by(county_clean) %>%
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
# Append the agrepl result
rowwise() %>%
mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() %>%
# Only retain minimum distances
group_by(county_clean) %>%
filter(dist == min(dist))
两种情况下的结果:
county county_clean state countyname dist string_agrep
1 Beauregard Beauregard LA Beauregard Parish 0 TRUE
2 De Soto De Soto LA De Soto Parish 0 TRUE
3 Dekalb Dekalb GA DeKalb 1 TRUE
4 Webster Webster LA Webster Parish 0 TRUE
5 Saint Joseph St. Joseph IN St. Joseph 0 TRUE
6 West Feliciana West Feliciana LA West Feliciana Parish 0 TRUE
7 Ketchikan Gateway Ketchikan Gateway AK Ketchikan Gateway Borough 0 TRUE
8 Evangeline Evangeline LA Evangeline Parish 0 TRUE
9 Richmond City Richmond City VA Richmond city 1 TRUE
10 Saint Mary St. Mary LA St. Mary Parish 0 TRUE
11 Saint Louis City St. Louis City MO St. Louis city 1 TRUE
12 Mclean Mclean KY McLean 1 TRUE
13 Union Union LA Union Parish 0 TRUE
14 Bienville Bienville LA Bienville Parish 0 TRUE
15 Covington City Covington City VA Covington city 1 TRUE
16 Martinsville City Martinsville City VA Martinsville city 1 TRUE
17 Claiborne Claiborne LA Claiborne Parish 0 TRUE
18 King And Queen King And Queen VA King and Queen 1 TRUE
19 Mclean Mclean ND McLean 1 TRUE
20 Mcminn Mcminn TN McMinn 1 TRUE
21 Prince Georges Prince Georges MD Prince George's 1 TRU
答案 1 :(得分:1)
没有示例数据,但尝试使用agrep而不是adist并仅搜索该州的名称
sapply(df_tomatch$county, function(x) agrep(x,df_matchby[df_matchby$state==dj_tomatch[x,'state'],'county'],value=TRUE)
您可以使用max.distance
中的agrep
参数来改变他们需要匹配的距离。此外,设置value=TRUE
将返回匹配字符串的值,而不是匹配位置。