我有一个数据帧movielens:
str(u.data)
'data.frame': 100000 obs. of 4 variables:
$ userID : int 196 186 22 244 166 298 115 253 305 6 ...
$ movieID : int 242 302 377 51 346 474 265 465 451 86 ...
$ rating : int 3 3 1 2 1 4 2 5 3 3 ...
$ timestamp: int 881250949 891717742 878887116 880606923 886397596 884182806 881171488 891628467 886324817 883603013 ...
和
str(u.item)
'data.frame': 1681 obs. of 20 variables:
$ unknown : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Action : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 1 1 ...
$ Adventure : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ Animation : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
$ Childrens : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 2 1 1 ...
$ Comedy : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
$ Crime : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
$ Documentary: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Drama : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 2 2 ...
$ Fantasy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Film-Noir : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Horror : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Musical : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Mystery : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Romance : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Sci-Fi : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
$ Thriller : Factor w/ 2 levels "0","1": 1 2 2 1 2 1 1 1 1 1 ...
$ War : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
$ Western : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ movieID : int 1 2 3 4 5 6 7 8 9 10 ...
u.data
的行数是100.000
nrow(u.data)
100000
和
nrow(u.item)
[1] 1681
然后,我想合并它们:
all_data = u.data
all_data = merge(all_data, u.item, by = "movieID")
但合并的数据只有99.999行
nrow(all_data)
[1] 99999
合并这两个数据框时我做错了吗?
答案 0 :(得分:0)
如果min(u.data$movieID) < min(u.item$movieID)
或max(u.data$movieID) > max(u.item$movieID)
,则会发生这种情况。后者的例子:
# max(u.data$movieID) = 10
u.data <- data.frame(movieID = 1:10, NAME = LETTERS[1:10])
dim(u.data)
# [1] 10 2
# max(u.item$movieID) = 11
u.item <- data.frame(movieID = c(1:9,11), name = letters[c(1:9,11)])
dim(u.item)
# [1] 10 2
out <- merge(u.data, u.item, by = "movieID")
dim(out)
# [1] 9 3
# check if all elements of u.item$movieID exist in u.data$movieID
is.element(u.data$movieID, u.item$movieID)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
Batanichek建议:
out <- merge(u.data, u.item, by = "movieID", all.x = TRUE)
dim(out)
# [1] 10 3