我正在尝试为data.table dt
计算新列的值。计算的一部分来自data.frame df
(也可能是data.table,到目前为止我根本不需要它。)
如果因子级别(此处:sample
)匹配,如何使用来自两个不同对象的值来计算新列?我过去常常合并两个对象并按行排成行,但这会导致大量的冗余数据。
这是data.frame,只有10行:
df
sample scaling_factor
A1 A1 111956565
A2 A2 89869320
A3 A3 120925219
A4 A4 111757559
A5 A5 77319341
A6 A6 89403194
A7 A7 150214981
B8 B8 133885925
B9 B9 86536587
B10 B10 123574939
df <- structure(list(sample = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
9L, 10L, 8L), .Label = c("A1", "A2", "A3", "A4", "A5", "A6",
"A7", "B10", "B8", "B9"), class = "factor"), scaling_factor = c(111956565.427018,
89869319.9348599, 120925219.4453, 111757558.886234, 77319340.5841949,
89403194.1170576, 150214980.784589, 133885925.080984, 86536586.7136393,
123574939.026597)), .Names = c("sample", "scaling_factor"), class = "data.frame", row.names = c("A1",
"A2", "A3", "A4", "A5", "A6", "A7", "B8", "B9", "B10"))
这是data.table,每个样本有几十万行(输出在输出中输出<
时遇到问题,所以这里没有提供):
setDT(dt)
sample contig_id product_reads_rpk
1: A1 contig_10 2000.00000
2: A1 contig_100 24.27184
3: A1 contig_1000 1713.90374
4: A1 contig_10000 2900.66225
5: A1 contig_100003 1713.94231
6: A1 contig_100004 8575.23511
7: A1 contig_100004 11059.32203
8: A2 contig_100009 6923.67400
9: A2 contig_100010 1285.30259
10: A2 contig_100015 84.74576
dt[,product_rpm := product_reads_rpk/(df$scaling_factor/1000000), by = sample]
我尝试根据product_rpm
中每个样本的相应值,在dt中生成新列df
。我怎么做?我得到longer object length is not a multiple of shorter object length
,但较短的对象长度为1,例如df A1
,对吧?
答案 0 :(得分:1)
我不知道如何在不实际合并两个数据集的情况下实现此目的 - 但如果使用合并数据集的data.table
方式,则可以避免创建冗余列。
所以,在你的情况下,它只是:
df <- data.table(df)
dt[df, product_rpm := (product_reads_rpk/scaling_factor/1000000), on = "sample"]
一个简单的例子:
library(data.table)
dt1 <- data.table(id = sample(1000:9999, size = 100),
size = sample(10000:99999, size = 100))
dt2 <- data.table(id = rep(dt1$id, 10),
group = rep(LETTERS[1:5], 10),
value = sample(1000:9999, size = 100 * 10, replace = T))
dt3 <- dt2[dt1, metric:= (value / size), on = "id"]
head(dt3)