Question

我有两个包含多列的数据帧。我在下面提供了一个较短版本的数据框，其中包含问题的相关列。

STR（DF1）

'data.frame':   10 obs. of  6 variables:
 $ itemid      : int  1 1 1 1 1 1 1 1 1 1
 $ userid      : int  650 635 1 514 250 210 5 72 77 252
 $ rating      : int  3 4 5 5 4 5 4 4 5 5
 $ time        : Date, format: "1998-03-31" "1997-11-07" "1997-09-22" ...
 $ title       : chr  "Toy Story " "Toy Story " "Toy Story " "Toy Story " ...
 $ release_date: chr  "1995" "1995" "1995" "1995" ...

DF1

 itemid userid rating       time      title release_date
1       1    650      3 1998-03-31 Toy Story          1995
2       1    635      4 1997-11-07 Toy Story          1995
3       1      1      5 1997-09-22 Toy Story          1995
4       1    514      5 1997-09-26 Toy Story          1995
5       1    250      4 1997-12-27 Toy Story          1995
6       1    210      5 1998-02-17 Toy Story          1995
7       1      5      4 1997-09-30 Toy Story          1995
8       1     72      4 1997-11-20 Toy Story          1995
9       1     77      5 1998-01-13 Toy Story          1995
10      1    252      5 1998-04-01 Toy Story          1995

STR（DF2）

'data.frame':   10 obs. of  6 variables:
 $ itemid      : int  2844 4936 4936 4972 5078 6684 6689 7264 7264 7880
 $ userid      : int  4477 8871 11628 16885 11628 4222 4222 2092 5943 11628
 $ rating      : int  6 8 5 8 4 6 6 8 6 7
 $ time        : Date, format: "2013-03-09" "2013-05-05" "2013-07-06" ...
 $ title       : chr  "FantÃ´mas - Ã€ l'ombre de la guillotine " "The Bank " "The Bank " "The Birth of a Nation " ...
 $ release_date: chr  "1913" "1915" "1915" "1915" ...

DF2

 itemid userid rating       time                                    title release_date
1    2844   4477      6 2013-03-09 FantÃ´mas - Ã€ l'ombre de la guillotine          1913
2    4936   8871      8 2013-05-05                                The Bank          1915
3    4936  11628      5 2013-07-06                                The Bank          1915
4    4972  16885      8 2013-08-19                   The Birth of a Nation          1915
5    5078  11628      4 2013-08-23                               The Cheat          1915
6    6684   4222      6 2013-08-24                             The Fireman          1916
7    6689   4222      6 2013-08-24                         The Floorwalker          1916
8    7264   2092      8 2013-03-17                                The Rink          1916
9    7264   5943      6 2013-05-12                                The Rink          1916
10   7880  11628      7 2013-07-19                             Easy Street          1917

我希望使用与Levenshtein距离度量的模糊字符串匹配来匹配数据集中的标题，并且还要确认标题是相同的＆quot; release_date＆＃39;匹配。有没有更好的方法来执行此任务而不使用循环？我尝试使用for循环播放＆＃39; agrep＆＃39;我的内存不足。输出应该是数据帧，但仅适用于匹配的电影。

原始数据框的行数超过100K。

感谢。

Answer 1

尝试agrep功能

title <- c("The Bank", "The Cheat", "The Rink", "The Ring", "Toy Story", "Toy Story 2")
for(i in seq_along(title)){
    x <- agrep(title[i], title[-i], value = TRUE)   
    cat("Title :", title[i], " matched to ", x, "\n")
}
Title : The Bank  matched to   
Title : The Cheat  matched to   
Title : The Rink  matched to  The Ring 
Title : The Ring  matched to  The Rink 
Title : Toy Story  matched to  Toy Story 2 
Title : Toy Story 2  matched to  Toy Story

你如何在r中执行模糊字符串匹配

1 个答案: