考虑以下数据框:
> tail(tot.final)
names.id sequence names.reads width.reads names.counts st end flag
819 125546 TAGCTTATATGACTGATGTTGACA 125546-4 24 4 8 31 TRUE
820 218783 TCGCTTATCAGACTGATGTTGAAA 218783-2 24 2 8 31 TRUE
821 272992 CAGCTTATCAGACTGATGTTGAAA 272992-2 24 2 8 31 TRUE
822 135191 TAGCTTATCAGACTGATGTTGAACA 135191-4 25 4 8 32 TRUE
823 278047 TAGCTTATCAGACTGATGTTGAAGA 278047-2 25 2 8 32 TRUE
824 317980 TAGCTTATCAGACTGATGTTGCCCT 317980-2 25 2 8 32 TRUE
head(plusa)
names.id sequence names.reads width.reads names.counts st end flag
2 28092 ATCAGACTGATGTTGAC 28092-29 17 29 14 30 TRUE
4 65308 TTATCAGACTGATGTTGA 65308-10 18 10 12 29 TRUE
6 71226 TATCAGACTGATGTTGAC 71226-9 18 9 13 30 TRUE
> nrow(tot.final)
[1] 824
> nrow(plusa)
[1] 421
plusa contains 451 rows with a common plusa$sequence column. (not sorted)
我想通过添加相应的plusa $序列的plusa $ names.counts值来更新tot.final $ names.counts元素。
考虑到“序列”字段为id,是否有可能以这种方式合并它们?
答案 0 :(得分:0)
据我所知,我认为这就是你想要做的事情:
plusa
到tot.final
在这种情况下,您可以使用plyr
库。我举了一个例子来说明,你应该能够适应你的:
library(plyr)
df.final <- data.frame(sequence=c('A','B','C','D'),
counts=c(100,123,234,200),
stringsAsFactors=F)
# sequence counts
# 1 A 100
# 2 B 123
# 3 C 234
# 4 D 200
df.plusa <- data.frame(sequence=c('A','E','C','F'),
counts=c(10,20,30,40),
stringsAsFactors=F)
# sequence counts
# 1 A 10
# 2 E 20
# 3 C 30
# 4 F 40
# rbind together and do the counts:
df.final.aggregated <- ddply(rbind(df.final,df.plusa),
.(sequence),
summarise,
counts=sum(counts))
# sequence counts
# 1 A 110
# 2 B 123
# 3 C 264
# 4 D 200
# 5 E 20
# 6 F 40
请注意ddply(dataframe,.(sequence),FUNCTION)
表示:
for each unique seq in dataframe$sequence:
do FUNCTION( dataframe[ dataframe$sequence==seq, ] )
merge them all back into one big dataframe.
对于您的特定数据,这可能有效(未进行测试,因为我没有您的数据):
ddply( rbind(tot.final,plusa), .(sequence), summarise,
names.counts = sum(names.counts) )