我有一个非常大的数据框:超过600万行,28个任何类型的变量(num,factor,characters)。我需要删除重复的行。但是,识别实际重复项的唯一方法是对大字符变量进行检查(每次观察大约1,000到2,000个字符)。
我可以很好地使用标准duplicated()
函数,但我不确定这是最有效的解决方案。
是否有任何功能或包能够有效地完成工作? 提前感谢您的建议。
structure(list(city = c("New York", "New York", "New York", "Brussels",
"London", "Arlington"), prodCategory = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = "4", class = "factor"), date = structure(c(16351,
16352, 16351, 16353, 16354, 16355), class = "Date"), userID = c("ABCD",
"XYZZ", "ABCD", "ABCD", "SDFG", "WEDGD"), review = c("in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat",
"this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.",
"in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat",
"Each time I go to Brussels I stop by this typical brasserie located in the historical heart of Brussels downtown at a walking distance from almost every interesting place. Food is great and the menu is really rich and diversified service is sharp and fast and pricing very reasonable. Dont miss the typical chocolate cake. Actually I should write dont miss... everything included the rich list of Belgian beers",
"That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop",
"Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat"
)), .Names = c("city", "prodCategory", "date", "userID", "review"
), row.names = c(NA, -6L), class = "data.frame")
答案 0 :(得分:1)
尝试
library(data.table)
setkey(setDT(df), review)
res <- unique(df)
dim(res)
#[1] 5 5
答案 1 :(得分:1)
另一种选择,虽然不一定更有效,但是计算数据:
df <- structure(list(city = c("New York", "New York", "New York", "Brussels",
"London", "Arlington"), prodCategory = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = "4", class = "factor"), date = structure(c(16351,
16352, 16351, 16353, 16354, 16355), class = "Date"), userID = c("ABCD",
"XYZZ", "ABCD", "ABCD", "SDFG", "WEDGD"), review = c("in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat",
"this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.",
"in my opinion one of the best pastrami or corned beef sandwiches places in NY (an much more). By the way each sandwich could feed a whole family for days... This establishment is situated close to the theatre district and time square. what a delight it was to see my turkey sandwich arrive. wow it was massive and delicious. ..The celebrity photos were awesome ..highly recommend this place for a true taste treat",
"Each time I go to Brussels I stop by this typical brasserie located in the historical heart of Brussels downtown at a walking distance from almost every interesting place. Food is great and the menu is really rich and diversified service is sharp and fast and pricing very reasonable. Dont miss the typical chocolate cake. Actually I should write dont miss... everything included the rich list of Belgian beers",
"That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop",
"Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat"
)), .Names = c("city", "prodCategory", "date", "userID", "review"
), row.names = c(NA, -6L), class = "data.frame")
# do the count
df[with(df, ave(paste(prodCategory, city), userID, FUN=function(x) length(unique(x))))==1,]
city prodCategory date userID
2 New York 4 2014-10-09 XYZZ
5 London 4 2014-10-11 SDFG
6 Arlington 4 2014-10-12 WEDGD
review
2 this is not the usual half-red-lobster place. It is a full experience of super top quality sea food for an amazingly convenient price from basic sandwiches up to fine cuisine each plate is a joy.
5 That is definitely what I would call great UK pub food --simple tasty not fat/heavy/greasy (... OK not healthy though) well presented service was efficient and overall atmosphere deserves a stop
6 Are you a fan of House of Cards ? Then you have not missed the amazing BBQ place where Frank Underwood loves to go. It looks like Rocklands is right for you. Different atmosphere but same kind of yummy meat