我有相当大的代谢物数据数据集。有些集合具有未标记的重复(没有列表示重复)。下面是一个小例子。
a<-structure(list(ABBRC = structure(c(1L, 2L, 2L, 3L, 4L, 4L, 4L
), .Label = c("X1", "X2", "X3", "X4"), class = "factor"), X = 1:7,
Y = 1:7, Year = c(2009L, 2009L, 2009L, 2009L, 2009L, 2009L,
2009L)), .Names = c("ABBRC", "X", "Y", "Year"), class = "data.frame", row.names = c(NA,
-7L))
b<-structure(list(ABBRC = structure(c(1L, 2L, 3L, 4L, 4L, 4L, 4L
), .Label = c("X1", "X2", "X3", "X4"), class = "factor"), Z = c(1L,
2L, 4L, 5L, 6L, 7L, 8L), A = c(1L, 2L, 4L, 5L, 6L, 7L, 8L), Year = c(2009L,
2009L, 2009L, 2009L, 2009L, 2009L, 2009L)), .Names = c("ABBRC",
"Z", "A", "Year"), class = "data.frame", row.names = c(NA, -7L
))
merge(a,b)
ABBRC Year X Y Z A
1 X1 2009 1 1 1 1
2 X2 2009 2 2 2 2
3 X2 2009 3 3 2 2
4 X3 2009 4 4 4 4
5 X4 2009 5 5 5 5
6 X4 2009 5 5 6 6
7 X4 2009 5 5 7 7
8 X4 2009 5 5 8 8
9 X4 2009 6 6 5 5
10 X4 2009 6 6 6 6
11 X4 2009 6 6 7 7
12 X4 2009 6 6 8 8
13 X4 2009 7 7 5 5
14 X4 2009 7 7 6 6
15 X4 2009 7 7 7 7
16 X4 2009 7 7 8 8
合并时,输出重复行的组合。这是预期的行为,但这不是我想要的。我希望将数据合并,就好像它们是重复一样(它们是)。 是否有一个函数来进行这种合并,或者更容易标记重复然后合并?如果标签更容易,那么这样做的好方法是什么?
期望输出
structure(list(ABBRC = structure(c(1L, 2L, 2L, 3L, 4L, 4L, 4L,
4L), .Label = c("X1", "X2", "X3", "X4"), class = "factor"), X = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, NA), Y = c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
NA), Z = c(1L, 2L, NA, 4L, 5L, 6L, 7L, 8L), A = c(1L, 2L, NA,
4L, 5L, 6L, 7L, 8L), Year = c(2009L, 2009L, 2009L, 2009L, 2009L,
2009L, 2009L, 2009L)), .Names = c("ABBRC", "X", "Y", "Z", "A",
"Year"), class = "data.frame", row.names = c(NA, -8L))
ABBRC X Y Z A Year
1 X1 1 1 1 1 2009
2 X2 2 2 2 2 2009
3 X2 3 3 NA NA 2009
4 X3 4 4 4 4 2009
5 X4 5 5 5 5 2009
6 X4 6 6 6 6 2009
7 X4 7 7 7 7 2009
8 X4 NA NA 8 8 2009
答案 0 :(得分:2)
删除我的第一次痛苦尝试后,这是另一种方法,但不如您自己的plyr
方法。它涉及首先生成一个虚拟time
变量。
a$time <- as.numeric(ave(as.character(a$ABBRC), a$ABBRC, a$Year, FUN=seq_along))
b$time <- as.numeric(ave(as.character(b$ABBRC), b$ABBRC, b$Year, FUN=seq_along))
library(reshape2)
ab.long <- rbind(melt(a, id.vars=c("ABBRC", "Year", "time")),
melt(b, id.vars=c("ABBRC", "Year", "time")))
dcast(ab.long, ABBRC + Year + time ~ variable)
# ABBRC Year time X Y Z A
# 1 X1 2009 1 1 1 1 1
# 2 X2 2009 1 2 2 2 2
# 3 X2 2009 2 3 3 NA NA
# 4 X3 2009 1 4 4 4 4
# 5 X4 2009 1 5 5 5 5
# 6 X4 2009 2 6 6 6 6
# 7 X4 2009 3 7 7 7 7
# 8 X4 2009 4 NA NA 8 8
答案 1 :(得分:2)
不确定回答你自己的问题是否很酷,但我想出了如何通过创建一个索引变量来做到这一点。感谢Hadley对plyr / seq_along()的一些建议。
require(plyr)
a<-ddply(a, .(ABBRC), transform, rep=seq_along(ABBRC))
b<-ddply(b, .(ABBRC), transform, rep=seq_along(ABBRC))
merge(a,b, all=T)
ABBRC Year rep X Y Z A
1 X1 2009 1 1 1 1 1
2 X2 2009 1 2 2 2 2
3 X2 2009 2 3 3 NA NA
4 X3 2009 1 4 4 4 4
5 X4 2009 1 5 5 5 5
6 X4 2009 2 6 6 6 6
7 X4 2009 3 7 7 7 7
8 X4 2009 4 NA NA 8 8
答案 2 :(得分:0)
有几种方法可以解决这个问题。一种方法是在合并之前识别重复项
merge(a, b[!duplicatesFromA, ])
# ABBRC Year X Y Z A
# 1 X4 2009 5 5 8 8
# 2 X4 2009 6 6 8 8
# 3 X4 2009 7 7 8 8
当然,有几种方法可以找到重复的内容 这是一个使用嵌套的apply循环的colSums。
duplicatesFromA <-
colSums(apply(b, 1, function(row.b) {
apply(a, 1, function(row.a) {
all(row.b==row.a)
})
})) > 0