Question

我有2个数据集，每个数据集超过100K行。我想基于模糊字符串匹配一列（＆＃39;电影标题＆＃39;）以及使用发布日期来合并它们。我正在提供以下两个数据集的样本。

数据集-1

itemid userid rating       time                              title release_date
99991    1673    835      3 1998-03-27                             mirage         1995
99992    1674    840      4 1998-03-29                         mamma roma         1962
99993    1675    851      3 1998-01-08                     sunchaser, the         1996
99994    1676    851      2 1997-10-01                   war at home, the         1996
99995    1677    854      3 1997-12-22                      sweet nothing         1995
99996    1678    863      1 1998-03-07                         mat' i syn         1997
99997    1679    863      3 1998-03-07                          b. monkey         1998
99998    1680    863      2 1998-03-07                      sliding doors         1998
99999    1681    896      3 1998-02-11                       you so crazy         1994
100000   1682    916      3 1997-11-29 scream of stone (schrei aus stein)         1991

数据集 - 2

itemid userid rating       time                                   title release_date
1    2844   4477      3 2013-03-09 fantã´mas - ã€ l'ombre de la guillotine         1913
2    4936   8871      4 2013-05-05                                the bank         1915
3    4936  11628      3 2013-07-06                                the bank         1915
4    4972  16885      4 2013-08-19                   the birth of a nation         1915
5    5078  11628      2 2013-08-23                               the cheat         1915
6    6684   4222      3 2013-08-24                             the fireman         1916
7    6689   4222      3 2013-08-24                         the floorwalker         1916
8    7264   2092      4 2013-03-17                                the rink         1916
9    7264   5943      3 2013-05-12                                the rink         1916
10   7880  11628      4 2013-07-19                             easy street         1917

我看过＆＃39; agrep＆＃39;但它一次只匹配一个字符串。＆＃39; stringdist＆＃39;函数是好的，但你需要在循环中运行它，找到最小距离然后进行进一步的进动，这是非常耗时的给定数据集的大小。字符串可以有拼写错误和特殊字符，因此需要进行模糊匹配。我环顾四周，找到了Lenenshtein＆＃39;和Jaro-Winkler＆＃39;方法。我读到的后者对于你在字符串中输入拼写错误很有用。

在这种情况下，只有模糊匹配可能无法提供良好的效果，例如，电影片名＆＃39;玩具故事＆＃39;在一个数据集中可以匹配玩具故事2＆＃39;在另一个不对的地方。因此，我需要考虑发布日期，以确保匹配的电影是唯一的。

我想知道是否有办法在不使用循环的情况下完成此任务？更糟糕的情况是如果我必须使用循环，我怎样才能使它尽可能快地有效地工作。

我尝试了以下代码，但是花了很多时间来处理。

for(i in 1:nrow(test))
  for(j in 1:nrow(test1))
  {

    test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
                      test$title, NA)
  }

test - 包含转换为小写的1682个唯一电影名称 test1 - 包含转换为小写的11451个唯一电影名称

有没有办法避免for循环并让它更快地运行？

Answer 1

这种让你前进的方法怎么样？看到结果后，您可以从0.85调整匹配程度。然后，您可以使用dplyr按匹配的标题进行分组，并通过减去发布日期进行汇总。任何零都意味着相同的发布日期。

dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

r中的模糊字符串匹配

1 个答案: