我的表格中有数据:
Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 Set_7
rs70812 4:12309 7:189029 2:2134 17:43232 12:51123 11:15123 19:4312
rs34812 5:61233 2:571022 1:57012 3:537012 14:57123 4:57129 1:61507
rs15602 1:571209 12:34120 9:41236 12:32417 3:57120 9:34123 3:41235
rs90143 7:83541 9:659123 5:23412 16:98234 18:472351 20:12357 1:13421
rs70823 14:89023 13:42081 8:32098 5:431332 9:234134 13:7831 2:74012
rs100980 11:51003 1:100098 10:409123 12:412309 13:34123 16:431098 3:58023
rs10341 18:90312 15:609123 1:70923 2:102358 5:019824 17:120394 9:80123
我实际上有10,000套和大约4,000行。但这是一个很好的样本。我还有一个文件:
set snpID rsMatch
1 4:12309 rs241984
2 7:189029 rs104141
3 2:2134 rs485506
4 17:43232 rs345180
5 12:51123 rs129819
6 11:15123 rs757492
7 19:4312 rs711403
1 5:61233 rs341098
2 2:571022 rs512309
3 1:57012 rs120394
4 3:537012 rs510293
5 14:571234 rs234098
6 4:57129 rs71302
7 1:61507 rs234109
1 1:571209 rs09384
... ... ...
我想将我的Set_1,Set_2,Set_3等的数字格式替换为其rsMatch格式,如下所示:
Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 Set_7
rs70812 rs241984 rs104141 rs485506 rs345180 rs129819 rs757492 rs711403
rs34812 rs341098 rs512309 rs120394 rs510293 rs234098 rs71302 rs234109
rs15602 rs098384 ... ... ... ... ...
... ... ... ... ... ... ...
你们有什么建议怎么做吗?我在想R数据帧,但我对任何事情都很开心......
答案 0 :(得分:2)
你应该在副本上工作,但我生活危险并且在原件上工作。首先,我们需要将Set_n列中的值与第二个输入数据帧匹配:
sapply(inp1[-1], match, inp2$snpID)
Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 Set_7
[1,] 1 2 3 4 5 6 7
[2,] 8 9 10 11 NA 13 14
[3,] 15 NA NA NA NA NA NA
[4,] NA NA NA NA NA NA NA
[5,] NA NA NA NA NA NA NA
[6,] NA NA NA NA NA NA NA
[7,] NA NA NA NA NA NA NA
您没有向我们提供所有必需的值,但NA将需要作为占位符。这些值是第二个数据帧中的索引位置。请注意它是转置的(这很容易用t()
修复:
下一步是使用rsMatch列中的查找值替换项目:
inp1[-1][] <- inp2$rsMatch[ t(sapply(inp1[-1], match, inp2$snpID)) ]
#----------------
> inp1
Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 Set_7
1 rs70812 rs241984 rs341098 rs09384 <NA> <NA> <NA> <NA>
2 rs34812 rs104141 rs512309 <NA> <NA> <NA> <NA> <NA>
3 rs15602 rs485506 rs120394 <NA> <NA> <NA> <NA> <NA>
4 rs90143 rs345180 rs510293 <NA> <NA> <NA> <NA> <NA>
5 rs70823 rs129819 <NA> <NA> <NA> <NA> <NA> <NA>
6 rs100980 rs757492 rs71302 <NA> <NA> <NA> <NA> <NA>
7 rs10341 rs711403 rs234109 <NA> <NA> <NA> <NA> <NA>
第二次尝试:索引可能是:'cbind(1.1 +(。9:nrow(inp2))%/%7,inp2 $ set + 1)'确实成功但是seq(。) - 方法说明的是更坚固一点。
out1 <- inp1; out1[ cbind( rep(1:(nrow(inp2)), length=nrow(inp2), each=7), inp2$set+1) ] <- inp2$rsMatch
> out1
Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 Set_7
1 rs70812 rs241984 rs104141 rs485506 rs345180 rs129819 rs757492 rs711403
2 rs34812 rs341098 rs512309 rs120394 rs510293 rs234098 rs71302 rs234109
3 rs15602 rs09384 12:34120 9:41236 12:32417 3:57120 9:34123 3:41235
4 rs90143 7:83541 9:659123 5:23412 16:98234 18:472351 20:12357 1:13421
5 rs70823 14:89023 13:42081 8:32098 5:431332 9:234134 13:7831 2:74012
6 rs100980 11:51003 1:100098 10:409123 12:412309 13:34123 16:431098 3:58023
7 rs10341 18:90312 15:609123 1:70923 2:102358 5:019824 17:120394 9:80123
在我看来,请求实际上并未在匹配中使用Input_SNP值。
答案 1 :(得分:2)
您可以在适当的转换后使用merge
解决此问题。我使用library(reshape2)
来获取正确形状的数据以进行合并,然后返回输出。
#read in files
df1<-read.table("file1",header=TRUE,stringsAsFactors=FALSE)
df2<-read.table("file2",header=TRUE,stringsAsFactors=FALSE)
library(reshape2)
m1<-melt(df1,id.vars="Input_SNP")
m2<-transform(df2,variable=paste0("Set_",set),value=snpID)
m<-merge(m1,m2)
out<-dcast(m,Input_SNP~variable,value.var="rsMatch")
print(out)
Input_SNP Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 Set_7
1 rs15602 rs09384 <NA> <NA> <NA> <NA> <NA> <NA>
2 rs34812 rs341098 rs512309 rs120394 rs510293 <NA> rs71302 rs234109
3 rs70812 rs241984 rs104141 rs485506 rs345180 rs129819 rs757492 rs711403
答案 2 :(得分:1)
请事先原谅我,但我在这里看到了Excel和SQL解决方案,因为您要关联两个不同的数据集(即数据库表,工作表)。在导入R之前,这两种解决方案仍然可以作为数据准备进行集成。对于未来的读者而言,这可能比OP更多。
Excel解决方案
简单VLookup
或Index/Match
(请参阅使用名为RsmatchWide,RsmatchLong的工作表的示例)。 IFERROR()
用于删除#NA
。
=IFERROR(INDEX(RsmatchLong!$C$2:$C$16,
MATCH(RsmatchWide!B2,RsmatchLong!$B$2:$B$16, FALSE)), "")
=IFERROR(VLOOKUP(RsmatchWide!B2,RsmatchLong!$B$2:$C$16,2,FALSE),"")
准备好后,将工作表保存为csv,然后导入R:
df <- read.csv("C:/Path/To/RsMatchDataset.csv")
SQL解决方案
为每个集合运行带有单个子查询的选择查询(下面的示例使用MS Access,但应该使用任何SQL方言,包括SQLite,MySQL,SQL Server等):
SELECT rFinal.Input_SNP,
(SELECT RsmatchLong.rsMatch
FROM RsmatchLong INNER JOIN RsmatchWide r1 ON RsmatchLong.snpID = r1.Set_1
WHERE r1.Input_SNP = rFinal.Input_SNP) As Set_1,
(SELECT RsmatchLong.rsMatch
FROM RsmatchLong INNER JOIN RsmatchWide r2 ON RsmatchLong.snpID = r2.Set_2
WHERE r2.Input_SNP = rFinal.Input_SNP) As Set_2,
(SELECT RsmatchLong.rsMatch
FROM RsmatchLong INNER JOIN RsmatchWide r3 ON RsmatchLong.snpID = r3.Set_3
WHERE r3.Input_SNP = rFinal.Input_SNP) As Set_3,
(SELECT RsmatchLong.rsMatch
FROM RsmatchLong INNER JOIN RsmatchWide r4 ON RsmatchLong.snpID = r4.Set_4
WHERE r4.Input_SNP = rFinal.Input_SNP) As Set_4,
(SELECT RsmatchLong.rsMatch
FROM RsmatchLong INNER JOIN RsmatchWide r5 ON RsmatchLong.snpID = r5.Set_5
WHERE r5.Input_SNP = rFinal.Input_SNP) As Set_5,
(SELECT RsmatchLong.rsMatch
FROM RsmatchLong INNER JOIN RsmatchWide r6 ON RsmatchLong.snpID = r6.Set_6
WHERE r6.Input_SNP = rFinal.Input_SNP) As Set_6,
(SELECT RsmatchLong.rsMatch
FROM RsmatchLong INNER JOIN RsmatchWide r7 ON RsmatchLong.snpID = r7.Set_7
WHERE r7.Input_SNP = rFinal.Input_SNP) As Set_7
FROM RsMatchWide rFinal
即使R可以创建基础表,然后使用RODBC运行查询:
library(RODBC)
conn <-odbcDriverConnect('driver={Microsoft Access Driver (*.mdb, *.accdb)};
DBQ=C:\\PathTo\\Database.accdb')
# SAVING DATA FRAMES AS NEW DB TABLES
sqlSave(conn, RsMatchWide, append=FALSE, rownames=TRUE)
sqlSave(conn, RsMatchLong, append=FALSE, rownames=TRUE)
# CREATING DATA FRAME FROM QUERY,
# QUERY STRING, strSQL, WILL BE SQL SELECT STATEMENT ABOVE
newdf <- sqlQuery(conn, strSQL)
close(conn)
我预见的唯一挑战是将其扩展到10,000套。 Excel具有列限制,各种SQL数据库也是如此。考虑在R中拆分和合并。
答案 3 :(得分:1)
使用data.table v1.9.5
- 安装说明here:
require(data.table) # v1.9.5+
setDT(dt)
setDT(key)
ids = seq_len(7L) # or 10000L in your case
cols = paste("Set", ids, sep="_")
on = "snpID"
for (i in ids) {
names(on) = cols[i]
dt[key[set == i], cols[i] := rsMatch, on = on]
}
dt[]
key[set == i]
子集化应该非常快,因为它使用set
列上的自动索引进行二进制搜索。对于与i
对应的每个子集,我们从相应snpID
列的dt
子集&quot; d data.table加入Set*
,并更新({{ 1}})通过引用与列cols[i] := rsMatch
相对应的列。
这应该既快又节省内存。