R:使用plyr在两个数据源的匹配子集之间执行模糊字符串匹配

时间:2015-06-17 17:39:59

标签: r plyr dplyr fuzzy-comparison

假设我列出了具有不同拼写错误数量的县或其他问题,以区别于2010 FIPS dataset(下面创建fips数据框的代码),但是拼写错误的县的状态驻留输入正确。这是我的完整数据集中的21个随机观察中的sample

tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
                                     "Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline", 
                                     "Richmond City", "Saint Mary", "Saint Louis City", "Mclean", 
                                     "Union", "Bienville", "Covington City", "Martinsville City", 
                                     "Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA", 
             "LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN", 
             "MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
             ), row.names = c(NA, -21L))

              county state
1         Beauregard    LA
2            De Soto    LA
3             Dekalb    GA
4            Webster    LA
5       Saint Joseph    IN
6     West Feliciana    LA
7  Ketchikan Gateway    AK
8         Evangeline    LA
9      Richmond City    VA
10        Saint Mary    LA
11  Saint Louis City    MO
12            Mclean    KY
13             Union    LA
14         Bienville    LA
15    Covington City    VA
16 Martinsville City    VA
17         Claiborne    LA
18    King And Queen    VA
19            Mclean    ND
20            Mcminn    TN
21    Prince Georges    MD

我已经使用adist创建了一个模糊字符串匹配算法,该算法将我县中约80%的县与fips中的县名相匹配。然而,有时它会匹配两个拼写相似的县,但来自不同的州(例如,#34; Webster,LA&#34;匹配到#34; Webster,GA&#34;而不是&#34; Webster Parrish ,洛杉矶&#34;)。

distance <- adist(tomatch$county, 
                  fips$countyname, 
                  partial = TRUE)


min.name <- apply(distance, 1, min)

matchedcounties <- NULL  

for(i in 1:nrow(distance)) {

  s2.i <- match(min.name[i], distance[i, ])
  s1.i <- i

  matchedcounties <- rbind(data.frame(s2.i = s2.i,
                                      s1.i = s1.i,
                                      s1name = tomatch[s1.i, ]$county, 
                                      s2name = fips[s2.i, ]$countyname, 
                                      adist = min.name[i]),
                           matchedcounties)

}

因此,我想将县的模糊字符串匹配限制为具有匹配状态的正确拼写版本。

我当前的算法制作了一个大矩阵,用于计算两个源之间的标准Levenshtein距离,然后选择具有最小距离的值。

为了解决我的问题,我猜测我需要创建一个可以应用于每个州的功能&#39;按ddply分组,但我对如何指示ddply函数中的组值应与其他数据框匹配感到困惑。使用任何其他软件包的dplyr解决方案或解决方案也将受到赞赏。

创建FIPS数据集的代码:

download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
              './nationalfips.txt')

fips <- read.csv('./nationalfips.txt', 
                 stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')

# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)

2 个答案:

答案 0 :(得分:2)

这是dplyr的一种方式。我首先使用州的FIPS名称加入tomatch data.frame(仅允许在州内匹配):

require(dplyr)
df <- tomatch %>% 
  left_join(fips, by="state")

接下来,我注意到很多县没有'圣'但是'圣'在FIPS数据集中。首先清理它可以改善所获得的结果。

df <- df %>%
    mutate(county_clean = gsub("Saint", "St.", county))

然后,按县分组此data.frame,并使用adist计算距离:

df <- df %>%
  group_by(county_clean) %>%                # Calculate the distance per county
  mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
  arrange(county, dist) # Used this for visual inspection.

请注意,我从结果矩阵中取出对角线,因为adist返回一个n x m矩阵,其中n表示x向量,m表示y向量(它计算所有组合)。 或者,您可以添加agrep结果:

df <- df %>%
  rowwise() %>% # 'group_by' a single row. 
  mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
  ungroup()   # Always a good idea to remove 'groups' after you're done.

然后像过去一样过滤,采取最小距离:

df <- df %>%
  group_by(county_clean) %>%   # Causes it to calculate the 'min' per group
  filter(dist == min(dist)) %>%
  ungroup()

请注意,这可能会导致为tomatch中的每个输入行返回多行 或者,在一次运行中完成所有操作(一旦我确信它正在执行它应该执行的操作,我通常会将代码更改为此格式):

df <- tomatch %>% 
  # Join on all names in the relevant state and clean 'St.'
  left_join(fips, by="state") %>%
  mutate(county_clean = gsub("Saint", "St.", county)) %>% 

  # Calculate the distances, per original county name.
  group_by(county_clean) %>%                
  mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%

  # Append the agrepl result
  rowwise() %>%
  mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
  ungroup() %>%  

  # Only retain minimum distances
  group_by(county_clean) %>%   
  filter(dist == min(dist))

两种情况下的结果:

              county      county_clean state                countyname dist string_agrep
1         Beauregard        Beauregard    LA         Beauregard Parish    0         TRUE
2            De Soto           De Soto    LA            De Soto Parish    0         TRUE
3             Dekalb            Dekalb    GA                    DeKalb    1         TRUE
4            Webster           Webster    LA            Webster Parish    0         TRUE
5       Saint Joseph        St. Joseph    IN                St. Joseph    0         TRUE
6     West Feliciana    West Feliciana    LA     West Feliciana Parish    0         TRUE
7  Ketchikan Gateway Ketchikan Gateway    AK Ketchikan Gateway Borough    0         TRUE
8         Evangeline        Evangeline    LA         Evangeline Parish    0         TRUE
9      Richmond City     Richmond City    VA             Richmond city    1         TRUE
10        Saint Mary          St. Mary    LA           St. Mary Parish    0         TRUE
11  Saint Louis City    St. Louis City    MO            St. Louis city    1         TRUE
12            Mclean            Mclean    KY                    McLean    1         TRUE
13             Union             Union    LA              Union Parish    0         TRUE
14         Bienville         Bienville    LA          Bienville Parish    0         TRUE
15    Covington City    Covington City    VA            Covington city    1         TRUE
16 Martinsville City Martinsville City    VA         Martinsville city    1         TRUE
17         Claiborne         Claiborne    LA          Claiborne Parish    0         TRUE
18    King And Queen    King And Queen    VA            King and Queen    1         TRUE
19            Mclean            Mclean    ND                    McLean    1         TRUE
20            Mcminn            Mcminn    TN                    McMinn    1         TRUE
21    Prince Georges    Prince Georges    MD           Prince George's    1         TRU  

答案 1 :(得分:1)

没有示例数据,但尝试使用agrep而不是adist并仅搜索该州的名称

sapply(df_tomatch$county, function(x) agrep(x,df_matchby[df_matchby$state==dj_tomatch[x,'state'],'county'],value=TRUE)

您可以使用max.distance中的agrep参数来改变他们需要匹配的距离。此外,设置value=TRUE将返回匹配字符串的值,而不是匹配位置。