你如何在r中执行模糊字符串匹配

时间:2015-01-25 14:05:06

标签: r string matching

我有两个包含多列的数据帧。我在下面提供了一个较短版本的数据框,其中包含问题的相关列。

STR(DF1)

'data.frame':   10 obs. of  6 variables:
 $ itemid      : int  1 1 1 1 1 1 1 1 1 1
 $ userid      : int  650 635 1 514 250 210 5 72 77 252
 $ rating      : int  3 4 5 5 4 5 4 4 5 5
 $ time        : Date, format: "1998-03-31" "1997-11-07" "1997-09-22" ...
 $ title       : chr  "Toy Story " "Toy Story " "Toy Story " "Toy Story " ...
 $ release_date: chr  "1995" "1995" "1995" "1995" ...

DF1

 itemid userid rating       time      title release_date
1       1    650      3 1998-03-31 Toy Story          1995
2       1    635      4 1997-11-07 Toy Story          1995
3       1      1      5 1997-09-22 Toy Story          1995
4       1    514      5 1997-09-26 Toy Story          1995
5       1    250      4 1997-12-27 Toy Story          1995
6       1    210      5 1998-02-17 Toy Story          1995
7       1      5      4 1997-09-30 Toy Story          1995
8       1     72      4 1997-11-20 Toy Story          1995
9       1     77      5 1998-01-13 Toy Story          1995
10      1    252      5 1998-04-01 Toy Story          1995

STR(DF2)

'data.frame':   10 obs. of  6 variables:
 $ itemid      : int  2844 4936 4936 4972 5078 6684 6689 7264 7264 7880
 $ userid      : int  4477 8871 11628 16885 11628 4222 4222 2092 5943 11628
 $ rating      : int  6 8 5 8 4 6 6 8 6 7
 $ time        : Date, format: "2013-03-09" "2013-05-05" "2013-07-06" ...
 $ title       : chr  "Fantômas - À l'ombre de la guillotine " "The Bank " "The Bank " "The Birth of a Nation " ...
 $ release_date: chr  "1913" "1915" "1915" "1915" ...

DF2

 itemid userid rating       time                                    title release_date
1    2844   4477      6 2013-03-09 Fantômas - À l'ombre de la guillotine          1913
2    4936   8871      8 2013-05-05                                The Bank          1915
3    4936  11628      5 2013-07-06                                The Bank          1915
4    4972  16885      8 2013-08-19                   The Birth of a Nation          1915
5    5078  11628      4 2013-08-23                               The Cheat          1915
6    6684   4222      6 2013-08-24                             The Fireman          1916
7    6689   4222      6 2013-08-24                         The Floorwalker          1916
8    7264   2092      8 2013-03-17                                The Rink          1916
9    7264   5943      6 2013-05-12                                The Rink          1916
10   7880  11628      7 2013-07-19                             Easy Street          1917

我希望使用与Levenshtein距离度量的模糊字符串匹配来匹配数据集中的标题,并且还要确认标题是相同的" release_date'匹配。有没有更好的方法来执行此任务而不使用循环?我尝试使用for循环播放' agrep'我的内存不足。输出应该是数据帧,但仅适用于匹配的电影。

原始数据框的行数超过100K。

感谢。

1 个答案:

答案 0 :(得分:1)

尝试agrep功能

title <- c("The Bank", "The Cheat", "The Rink", "The Ring", "Toy Story", "Toy Story 2")
for(i in seq_along(title)){
    x <- agrep(title[i], title[-i], value = TRUE)   
    cat("Title :", title[i], " matched to ", x, "\n")
}
Title : The Bank  matched to   
Title : The Cheat  matched to   
Title : The Rink  matched to  The Ring 
Title : The Ring  matched to  The Rink 
Title : Toy Story  matched to  Toy Story 2 
Title : Toy Story 2  matched to  Toy Story