在两列上合并两个表,而没有变量的顺序起作用

时间:2016-11-09 11:07:48

标签: r merge

我有两个不同行号的表。我想根据两列的内容合并表。然而,问题是我不希望合并时变量的顺序很重要。例如:

表1:

Gene1  Gene2  p-value
ARID1A  TP53  0.0007
ATM     ATR   0.004

表2:

merge(Table1, Table2, by = c("Gene1", "Gene2"), all.x = TRUE)

我试过了:

Camera

但问题是它只会合并“ATM'和' ATR'但不是' TP53'和' ARID1A'因为它们的顺序不一样。

有没有办法合并两个表而不考虑列顺序?

2 个答案:

答案 0 :(得分:3)

使用sqldf

library(sqldf)

sqldf("
SELECT df1.*, 
       df2.`p.value` 
FROM   df1, df2 
WHERE (df1.Gene1 = df2.Gene1 AND
       df1.Gene2 = df2.Gene2) OR
      (df1.Gene1 = df2.Gene2 AND
       df1.Gene2 = df2.Gene1)")

#   Gene1  Gene2 p.value p.value
# 1  TP53 ARID1A   1e-03   7e-04
# 2   ATM    ATR   5e-04   4e-03

答案 1 :(得分:1)

我们可以对基因名称进行排序然后合并:

#sort gene names
df1$GeneMin <- pmin(df1$Gene1, df1$Gene2)
df1$GeneMax <- pmax(df1$Gene1, df1$Gene2)

df2$GeneMin <- pmin(df2$Gene1, df2$Gene2)
df2$GeneMax <- pmax(df2$Gene1, df2$Gene2)

# then merge
merge(df1, df2, by = c("GeneMin", "GeneMax"))
#   GeneMin GeneMax Gene1.x Gene2.x p.value.x Gene1.y Gene2.y p.value.y
# 1  ARID1A    TP53    TP53  ARID1A     1e-03  ARID1A    TP53     7e-04
# 2     ATM     ATR     ATM     ATR     5e-04     ATM     ATR     4e-03

# tidy up columns, column names
#....

或者我们可以合并两次然后rbind:

# double merge, this might cause unexpected results
rbind(
  merge(df1, df2, by = c("Gene1", "Gene2")),
  merge(df1, df2, by.x = c("Gene1", "Gene2"), by.y = c("Gene2", "Gene1"))
  )
#   Gene1  Gene2 p.value.x p.value.y
# 1   ATM    ATR     5e-04     4e-03
# 2  TP53 ARID1A     1e-03     7e-04

数据

# data
df1 <- read.table(text = "
Gene1 Gene2   p-value
TP53  ARID1A  0.001
ATM   ATR     0.0005", header = TRUE, as.is = TRUE)

df2 <- read.table(text = "
Gene1  Gene2  p-value
ARID1A  TP53  0.0007
ATM     ATR   0.004", header = TRUE, as.is = TRUE)