在R数据帧中替换具有匹配ID的单元格

时间:2015-07-29 23:26:57

标签: r dataframe match matching

我的表格中有数据:

Input_SNP       Set_1     Set_2     Set_3     Set_4     Set_5    Set_6     Set_7
rs70812    4:12309   7:189029   2:2134   17:43232  12:51123  11:15123  19:4312
rs34812    5:61233   2:571022  1:57012   3:537012  14:57123  4:57129   1:61507
rs15602    1:571209  12:34120  9:41236   12:32417  3:57120   9:34123   3:41235
rs90143    7:83541   9:659123  5:23412   16:98234  18:472351 20:12357  1:13421
rs70823    14:89023  13:42081  8:32098   5:431332  9:234134  13:7831   2:74012
rs100980   11:51003  1:100098  10:409123 12:412309 13:34123  16:431098 3:58023
rs10341    18:90312  15:609123 1:70923   2:102358  5:019824  17:120394 9:80123

我实际上有10,000套和大约4,000行。但这是一个很好的样本。我还有一个文件:

set snpID     rsMatch
1   4:12309   rs241984
2   7:189029  rs104141
3   2:2134    rs485506
4   17:43232  rs345180
5   12:51123  rs129819
6   11:15123  rs757492
7   19:4312   rs711403
1   5:61233   rs341098
2   2:571022  rs512309
3   1:57012   rs120394
4   3:537012  rs510293
5   14:571234 rs234098
6   4:57129   rs71302
7   1:61507   rs234109
1   1:571209  rs09384
... ...       ...

我想将我的Set_1,Set_2,Set_3等的数字格式替换为其rsMatch格式,如下所示:

    Input_SNP  Set_1     Set_2     Set_3     Set_4     Set_5     Set_6     Set_7
    rs70812    rs241984 rs104141 rs485506 rs345180 rs129819 rs757492 rs711403
    rs34812    rs341098 rs512309 rs120394 rs510293 rs234098 rs71302  rs234109
    rs15602    rs098384 ...       ...       ...       ...       ...
...        ...       ...       ...       ...       ...       ...

你们有什么建议怎么做吗?我在想R数据帧,但我对任何事情都很开心......

4 个答案:

答案 0 :(得分:2)

你应该在副本上工作,但我生活危险并且在原件上工作。首先,我们需要将Set_n列中的值与第二个输入数据帧匹配:

 sapply(inp1[-1], match, inp2$snpID)
     Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 Set_7
[1,]     1     2     3     4     5     6     7
[2,]     8     9    10    11    NA    13    14
[3,]    15    NA    NA    NA    NA    NA    NA
[4,]    NA    NA    NA    NA    NA    NA    NA
[5,]    NA    NA    NA    NA    NA    NA    NA
[6,]    NA    NA    NA    NA    NA    NA    NA
[7,]    NA    NA    NA    NA    NA    NA    NA

您没有向我们提供所有必需的值,但NA将需要作为占位符。这些值是第二个数据帧中的索引位置。请注意它是转置的(这很容易用t()修复:

下一步是使用rsMatch列中的查找值替换项目:

inp1[-1][] <- inp2$rsMatch[ t(sapply(inp1[-1], match, inp2$snpID)) ]
#----------------
> inp1
  Input_SNP    Set_1    Set_2   Set_3 Set_4 Set_5 Set_6 Set_7
1   rs70812 rs241984 rs341098 rs09384  <NA>  <NA>  <NA>  <NA>
2   rs34812 rs104141 rs512309    <NA>  <NA>  <NA>  <NA>  <NA>
3   rs15602 rs485506 rs120394    <NA>  <NA>  <NA>  <NA>  <NA>
4   rs90143 rs345180 rs510293    <NA>  <NA>  <NA>  <NA>  <NA>
5   rs70823 rs129819     <NA>    <NA>  <NA>  <NA>  <NA>  <NA>
6  rs100980 rs757492  rs71302    <NA>  <NA>  <NA>  <NA>  <NA>
7   rs10341 rs711403 rs234109    <NA>  <NA>  <NA>  <NA>  <NA>

第二次尝试:索引可能是:'cbind(1.1 +(。9:nrow(inp2))%/%7,inp2 $ set + 1)'确实成功但是seq(。) - 方法说明的是更坚固一点。

   out1 <- inp1; out1[ cbind( rep(1:(nrow(inp2)), length=nrow(inp2), each=7), inp2$set+1) ] <- inp2$rsMatch

> out1
  Input_SNP    Set_1     Set_2     Set_3     Set_4     Set_5     Set_6    Set_7
1   rs70812 rs241984  rs104141  rs485506  rs345180  rs129819  rs757492 rs711403
2   rs34812 rs341098  rs512309  rs120394  rs510293  rs234098   rs71302 rs234109
3   rs15602  rs09384  12:34120   9:41236  12:32417   3:57120   9:34123  3:41235
4   rs90143  7:83541  9:659123   5:23412  16:98234 18:472351  20:12357  1:13421
5   rs70823 14:89023  13:42081   8:32098  5:431332  9:234134   13:7831  2:74012
6  rs100980 11:51003  1:100098 10:409123 12:412309  13:34123 16:431098  3:58023
7   rs10341 18:90312 15:609123   1:70923  2:102358  5:019824 17:120394  9:80123

在我看来,请求实际上并未在匹配中使用Input_SNP值。

答案 1 :(得分:2)

您可以在适当的转换后使用merge解决此问题。我使用library(reshape2)来获取正确形状的数据以进行合并,然后返回输出。

#read in files
df1<-read.table("file1",header=TRUE,stringsAsFactors=FALSE)   
df2<-read.table("file2",header=TRUE,stringsAsFactors=FALSE)

library(reshape2)
m1<-melt(df1,id.vars="Input_SNP")
m2<-transform(df2,variable=paste0("Set_",set),value=snpID)
m<-merge(m1,m2)
out<-dcast(m,Input_SNP~variable,value.var="rsMatch")

print(out)

  Input_SNP    Set_1    Set_2    Set_3    Set_4    Set_5    Set_6    Set_7
1   rs15602  rs09384     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>
2   rs34812 rs341098 rs512309 rs120394 rs510293     <NA>  rs71302 rs234109
3   rs70812 rs241984 rs104141 rs485506 rs345180 rs129819 rs757492 rs711403

答案 2 :(得分:1)

请事先原谅我,但我在这里看到了Excel和SQL解决方案,因为您要关联两个不同的数据集(即数据库表,工作表)。在导入R之前,这两种解决方案仍然可以作为数据准备进行集成。对于未来的读者而言,这可能比OP更多。

Excel解决方案

简单VLookupIndex/Match(请参阅使用名为RsmatchWide,RsmatchLong的工作表的示例)。 IFERROR()用于删除#NA

=IFERROR(INDEX(RsmatchLong!$C$2:$C$16, 
         MATCH(RsmatchWide!B2,RsmatchLong!$B$2:$B$16, FALSE)), "")

=IFERROR(VLOOKUP(RsmatchWide!B2,RsmatchLong!$B$2:$C$16,2,FALSE),"")

RsMatch in Excel

准备好后,将工作表保存为csv,然后导入R:

df <- read.csv("C:/Path/To/RsMatchDataset.csv")

SQL解决方案

为每个集合运行带有单个子查询的选择查询(下面的示例使用MS Access,但应该使用任何SQL方言,包括SQLite,MySQL,SQL Server等):

SELECT rFinal.Input_SNP,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r1 ON RsmatchLong.snpID = r1.Set_1
   WHERE r1.Input_SNP = rFinal.Input_SNP) As Set_1,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r2 ON RsmatchLong.snpID = r2.Set_2
   WHERE r2.Input_SNP = rFinal.Input_SNP) As Set_2,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r3 ON RsmatchLong.snpID = r3.Set_3
   WHERE r3.Input_SNP = rFinal.Input_SNP) As Set_3,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r4 ON RsmatchLong.snpID = r4.Set_4
   WHERE r4.Input_SNP = rFinal.Input_SNP) As Set_4,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r5 ON RsmatchLong.snpID = r5.Set_5
   WHERE r5.Input_SNP = rFinal.Input_SNP) As Set_5,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r6 ON RsmatchLong.snpID = r6.Set_6
   WHERE r6.Input_SNP = rFinal.Input_SNP) As Set_6,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r7 ON RsmatchLong.snpID = r7.Set_7
   WHERE r7.Input_SNP = rFinal.Input_SNP) As Set_7

FROM RsMatchWide rFinal

RsMatch in SQL

即使R可以创建基础表,然后使用RODBC运行查询:

library(RODBC) 

conn <-odbcDriverConnect('driver={Microsoft Access Driver (*.mdb, *.accdb)};
                          DBQ=C:\\PathTo\\Database.accdb')

# SAVING DATA FRAMES AS NEW DB TABLES
sqlSave(conn, RsMatchWide, append=FALSE, rownames=TRUE)
sqlSave(conn, RsMatchLong, append=FALSE, rownames=TRUE)

# CREATING DATA FRAME FROM QUERY, 
# QUERY STRING, strSQL, WILL BE SQL SELECT STATEMENT ABOVE
newdf <- sqlQuery(conn, strSQL)

close(conn) 

我预见的唯一挑战是将其扩展到10,000套。 Excel具有列限制,各种SQL数据库也是如此。考虑在R中拆分和合并。

答案 3 :(得分:1)

使用data.table v1.9.5 - 安装说明here

require(data.table) # v1.9.5+
setDT(dt)
setDT(key)
ids  = seq_len(7L) # or 10000L in your case
cols = paste("Set", ids, sep="_")
on   = "snpID"
for (i in ids) {
    names(on) = cols[i]
    dt[key[set == i], cols[i] := rsMatch, on = on]
}
dt[]

key[set == i]子集化应该非常快,因为它使用set列上的自动索引进行二进制搜索。对于与i对应的每个子集,我们从相应snpID列的dt子集&quot; d data.table加入Set*,并更新({{ 1}})通过引用与列cols[i] := rsMatch相对应的列。

这应该既快又节省内存。