我有两个包含多列的数据帧。我在下面提供了一个较短版本的数据框,其中包含问题的相关列。
STR(DF1)
'data.frame': 10 obs. of 6 variables:
$ itemid : int 1 1 1 1 1 1 1 1 1 1
$ userid : int 650 635 1 514 250 210 5 72 77 252
$ rating : int 3 4 5 5 4 5 4 4 5 5
$ time : Date, format: "1998-03-31" "1997-11-07" "1997-09-22" ...
$ title : chr "Toy Story " "Toy Story " "Toy Story " "Toy Story " ...
$ release_date: chr "1995" "1995" "1995" "1995" ...
DF1
itemid userid rating time title release_date
1 1 650 3 1998-03-31 Toy Story 1995
2 1 635 4 1997-11-07 Toy Story 1995
3 1 1 5 1997-09-22 Toy Story 1995
4 1 514 5 1997-09-26 Toy Story 1995
5 1 250 4 1997-12-27 Toy Story 1995
6 1 210 5 1998-02-17 Toy Story 1995
7 1 5 4 1997-09-30 Toy Story 1995
8 1 72 4 1997-11-20 Toy Story 1995
9 1 77 5 1998-01-13 Toy Story 1995
10 1 252 5 1998-04-01 Toy Story 1995
STR(DF2)
'data.frame': 10 obs. of 6 variables:
$ itemid : int 2844 4936 4936 4972 5078 6684 6689 7264 7264 7880
$ userid : int 4477 8871 11628 16885 11628 4222 4222 2092 5943 11628
$ rating : int 6 8 5 8 4 6 6 8 6 7
$ time : Date, format: "2013-03-09" "2013-05-05" "2013-07-06" ...
$ title : chr "Fantômas - À l'ombre de la guillotine " "The Bank " "The Bank " "The Birth of a Nation " ...
$ release_date: chr "1913" "1915" "1915" "1915" ...
DF2
itemid userid rating time title release_date
1 2844 4477 6 2013-03-09 Fantômas - À l'ombre de la guillotine 1913
2 4936 8871 8 2013-05-05 The Bank 1915
3 4936 11628 5 2013-07-06 The Bank 1915
4 4972 16885 8 2013-08-19 The Birth of a Nation 1915
5 5078 11628 4 2013-08-23 The Cheat 1915
6 6684 4222 6 2013-08-24 The Fireman 1916
7 6689 4222 6 2013-08-24 The Floorwalker 1916
8 7264 2092 8 2013-03-17 The Rink 1916
9 7264 5943 6 2013-05-12 The Rink 1916
10 7880 11628 7 2013-07-19 Easy Street 1917
我希望使用与Levenshtein距离度量的模糊字符串匹配来匹配数据集中的标题,并且还要确认标题是相同的" release_date'匹配。有没有更好的方法来执行此任务而不使用循环?我尝试使用for循环播放' agrep'我的内存不足。输出应该是数据帧,但仅适用于匹配的电影。
原始数据框的行数超过100K。
感谢。
答案 0 :(得分:1)
尝试agrep
功能
title <- c("The Bank", "The Cheat", "The Rink", "The Ring", "Toy Story", "Toy Story 2")
for(i in seq_along(title)){
x <- agrep(title[i], title[-i], value = TRUE)
cat("Title :", title[i], " matched to ", x, "\n")
}
Title : The Bank matched to
Title : The Cheat matched to
Title : The Rink matched to The Ring
Title : The Ring matched to The Rink
Title : Toy Story matched to Toy Story 2
Title : Toy Story 2 matched to Toy Story