我有三个基因变量:
第一个是$ python3 -c "from itertools import count
with open('samplein') as file:
for i in count():
firstline = next(file, None)
if firstline is None:
break
with open(f'out{i}', 'w') as out:
out.write(firstline)
for line in file:
out.write(line)
if line == '-|\n':
break"
;
第二个c("DNAH8","F2RL2","F5","FAM3B")
;
第三个是c("F5", "GRIN3A","HMGCS2","HOXC4","KLK12")
;
如何将它们合并为一个包含四个变量的数据框,第一个变量包含所有基因,然后与其他三个变量匹配,作为转换后的格式?
答案 0 :(得分:1)
试试这个:
# all vectors in a list
l <- list(a1 = c("DNAH8","F2RL2","F5","FAM3B"),
b1 = c("F5", "GRIN3A","HMGCS2","HOXC4","KLK12"),
c1 = c("DNAH8","F2RL2","F5","FAM3B","HOXC4"))
# get unique genes
genes <- unique(unlist(l))
# compare genes with each vector in a list
cbind(genes, data.frame(lapply(l, function(i) as.numeric(genes %in% i))))
# genes a1 b1 c1
# 1 DNAH8 1 0 1
# 2 F2RL2 1 0 1
# 3 F5 1 1 1
# 4 FAM3B 1 0 1
# 5 GRIN3A 0 1 0
# 6 HMGCS2 0 1 0
# 7 HOXC4 0 1 1
# 8 KLK12 0 1 0
或者,如果我们希望显示基因的名称,那么试试这个:
# convert list of vectors into list of data.frames then merge
res <- Reduce(function(...) merge(..., by = "gene", all = TRUE),
lapply(l, function(i) data.frame(gene = i, var = i)))
# update col names
colnames(res) <- c("allgene", paste0("geneset", 1:length(l)))
res
# allgene geneset1 geneset2 geneset3
# 1 DNAH8 DNAH8 <NA> DNAH8
# 2 F2RL2 F2RL2 <NA> F2RL2
# 3 F5 F5 F5 F5
# 4 FAM3B FAM3B <NA> FAM3B
# 5 GRIN3A <NA> GRIN3A <NA>
# 6 HMGCS2 <NA> HMGCS2 <NA>
# 7 HOXC4 <NA> HOXC4 HOXC4
# 8 KLK12 <NA> KLK12 <NA>