R中的行明智对

时间:2015-07-06 14:23:21

标签: r

我想知道最好的方法。

使用此格式..

gene Sample1 Sample2 Sample3 ....
A   0 2 0
B   1 1 3
C 1.32 3.21 3.33
....

到那个结果

gene Sample1 Sample2 Sample3 ....
A-B -1 1 -3 
A-C -0.32 -2.21 -0.33
A-D
...
B-C
...

给我一​​个建议。谢谢!

数据太大但我想快速处理它。

2 个答案:

答案 0 :(得分:1)

使用expand.grid()非常适合获取包含不同向量的所有组合的数据帧。在这种情况下,您希望将矢量的每个组合与其自身相结合。此代码执行您想要执行的操作,但可能有更快的方法。对于N = 1,000,我的机器需要7秒钟。

set.seed(1)
N <- 5
d <- data.frame(gene = 1:N, sample.1 = sample(N), sample.2 = sample(N))
head(d)
     gene sample.1 sample.2
1    1        2        5
2    2        5        4
3    3        4        2
4    4        3        3
5    5        1        1

df <- expand.grid(list(d$gene, d$gene))
df <- merge(df, d, by.x = "Var1", by.y = "gene")
df <- merge(df, d, by.x = "Var2", by.y = "gene")
df$gene.diff <- paste(df$Var1, "-", df$Var2)
df$sample.1.diff <- df$sample.1.x - df$sample.1.y
df$sample.2.diff <- df$sample.2.x - df$sample.2.y

# Only need one difference between each pair of genes:
df <- df[df$Var1 > df$Var2, ]

df <- df[, names(df) %in% c("gene.diff", grep("diff", names(df), value = TRUE))]
head(df, n = 8)

        gene.diff sample.1.diff sample.2.diff
 2      2 - 1             3            -1
 3      3 - 1             2            -3
 4      4 - 1             1            -2
 5      5 - 1            -1            -4
 8      3 - 2            -1            -2
 9      4 - 2            -2            -1
 10     5 - 2            -4            -3
 14     4 - 3            -1             1

答案 1 :(得分:1)

此解决方案消除了merge步骤,更好地利用了矩阵运算。

#reproducible example!
data <- data.frame(gene=LETTERS[1:3], Sample1=c(0,1,1.321), 
                     Sample2 = c(2,1,3.21), Sample3=c(0,3,3.33))

# hooray for cartesian join
combos <- subset(merge(data,data,by=NULL, suffixes=c(".1",".2")), gene.1 != gene.2)

gene1_vals <- combos[,2:ncol(data)]
gene2_vals <- combos[,(ncol(data)+2):(2*ncol(data))]
gene_diff_txt <- paste(combos[,1], combos[,ncol(data)+1],sep="-")

gene_diffs <- data.frame(gene1_vals - gene2_vals)
names(gene_diffs) <- paste0("Sample",1:ncol(gene1_vals))
data.frame(gene=gene_diff_txt, gene_diffs)[order(combos$gene.1, combos$gene.2),]
#   gene Sample1 Sample2 Sample3
# 4  A-B  -1.000    1.00   -3.00
# 7  A-C  -1.321   -1.21   -3.33
# 2  B-A   1.000   -1.00    3.00
# 8  B-C  -0.321   -2.21   -0.33
# 3  C-A   1.321    1.21    3.33
# 6  C-B   0.321    2.21    0.33