我有一个数据集,我试图按1列(rssnp1列)中的重复ID进行排序,但是我只能找到重复函数来在线删除重复项。
我的数据如下:
Chr Start End rssnp1 Type gene
1 1244733 1244734 rs2286773 LD_SNP ACE
1 1257536 1257436 rs301159 LD_SNP CPEB4
1 1252336 1252336 rs2286773 Sentinel CPEB4
1 1252343 1252343 rs301159 LD_SNP CPEB4
1 1254841 1254841 rs301159 LD_SNP CPEB4
1 1256703 1267404 rs301159 LD_SNP CPEB4
1 1269246 1269246 rs301159 LD_SNP CPEB4
1 1370168 1370168 rs301159 LD_SNP GLUPA1
1 1371824 1371824 rs301159 LD_SNP GLUPA1
1 1372591 1372591 rs301159 LD_SNP GLUPA1
我的输出目标是:
Chr Start End rssnp1 Type gene
1 1244733 1244734 rs2286773 LD_SNP ACE
1 1252336 1252336 rs2286773 Sentinel CPEB4
1 1257536 1257436 rs301159 LD_SNP CPEB4
1 1252343 1252343 rs301159 LD_SNP CPEB4
1 1254841 1254841 rs301159 LD_SNP CPEB4
1 1256703 1267404 rs301159 LD_SNP CPEB4
1 1269246 1269246 rs301159 LD_SNP CPEB4
1 1370168 1370168 rs301159 LD_SNP GLUPA1
1 1371824 1371824 rs301159 LD_SNP GLUPA1
1 1372591 1372591 rs301159 LD_SNP GLUPA1
要重现数据,请使用:
structure(list(Chr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), Start = c(1244733,
1257536, 1252336, 1252343, 1254841, 1256703, 1269246, 1370168,
1371824, 1372591), End = c(1244734, 1257436, 1252336, 1252343,
1254841, 1267404, 1269246, 1370168, 1371824, 1372591), rssnp1 = c("rs2286773",
"rs301159", "rs2286773", "rs301159", "rs301159", "rs301159",
"rs301159", "rs301159", "rs301159", "rs301159"), Type = c("LD_SNP",
"LD_SNP", "Sentinel", "LD_SNP", "LD_SNP", "LD_SNP", "LD_SNP",
"LD_SNP", "LD_SNP", "LD_SNP"), gene = c("ACE", "CPEB4", "CPEB4",
"CPEB4", "CPEB4", "CPEB4", "CPEB4", "GLUPA1", "GLUPA1", "GLUPA1"
)), .Names = c("Chr", "Start", "End", "rssnp1", "Type", "gene"
), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
我已经尝试过:
target_order <- c("a", "b", "c")
df[order(match(df$rssnp1)), target_order]
使用target_order中的每个唯一值而不是 c(“ a”,“ b”,“ c”)-所以我得到了类似c(“ rs2286773”,“ rs301159” ...)之类的东西,它可以用于我拥有的数百个ID。 但这会导致错误:
Error in `[.data.frame`(df, order(match(df$rssnp1)), target_order) :
undefined columns selected
还有其他方法可以做到吗?
编辑:
target_order需要位于代码的不同部分:
df[order(match(df$rssnp1, target_order)), ]
但是,对于我来说,这仍然是一个乏味的工作方式-是否有更有效的方法来按重复进行排序?
答案 0 :(得分:1)
根据我对您的描述的理解,您希望结果遵循 target_order 在其他地方计算出的特定顺序。这应该可以通过合并操作来完成。
假设您有以下顺序。
target_order<-c("rs301159", "rs2286773")
dt <- structure(list(Chr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), Start = c(1244733,
1257536, 1252336, 1252343, 1254841, 1256703, 1269246, 1370168,
1371824, 1372591), End = c(1244734, 1257436, 1252336, 1252343,
1254841, 1267404, 1269246, 1370168, 1371824, 1372591), rssnp1 = c("rs2286773",
"rs301159", "rs2286773", "rs301159", "rs301159", "rs301159",
"rs301159", "rs301159", "rs301159", "rs301159"), Type = c("LD_SNP",
"LD_SNP", "Sentinel", "LD_SNP", "LD_SNP", "LD_SNP", "LD_SNP",
"LD_SNP", "LD_SNP", "LD_SNP"), gene = c("ACE", "CPEB4", "CPEB4",
"CPEB4", "CPEB4", "CPEB4", "CPEB4", "GLUPA1", "GLUPA1", "GLUPA1"
)), .Names = c("Chr", "Start", "End", "rssnp1", "Type", "gene"
), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
下面的代码应该能够产生所需的内容。
library(data.table)
setDT(dt)
# Setting sort=FALSE to persist the order in target_order
merge(as.data.table(target_order), dt, by.y="rssnp1", by.x="target_order", sort=FALSE)
# target_order Chr Start End Type gene
# 1: rs301159 1 1257536 1257436 LD_SNP CPEB4
# 2: rs301159 1 1252343 1252343 LD_SNP CPEB4
# 3: rs301159 1 1254841 1254841 LD_SNP CPEB4
# 4: rs301159 1 1256703 1267404 LD_SNP CPEB4
# 5: rs301159 1 1269246 1269246 LD_SNP CPEB4
# 6: rs301159 1 1370168 1370168 LD_SNP GLUPA1
# 7: rs301159 1 1371824 1371824 LD_SNP GLUPA1
# 8: rs301159 1 1372591 1372591 LD_SNP GLUPA1
# 9: rs2286773 1 1244733 1244734 LD_SNP ACE
# 10: rs2286773 1 1252336 1252336 Sentinel CPEB4