Question

我有三个基因变量：

第一个是$ python3 -c "from itertools import count with open('samplein') as file: for i in count(): firstline = next(file, None) if firstline is None: break with open(f'out{i}', 'w') as out: out.write(firstline) for line in file: out.write(line) if line == '-|\n': break";
第二个c("DNAH8","F2RL2","F5","FAM3B");
第三个是c("F5", "GRIN3A","HMGCS2","HOXC4","KLK12");

如何将它们合并为一个包含四个变量的数据框，第一个变量包含所有基因，然后与其他三个变量匹配，作为转换后的格式？

Answer 1

试试这个：

# all vectors in a list
l <- list(a1 = c("DNAH8","F2RL2","F5","FAM3B"),
          b1 = c("F5", "GRIN3A","HMGCS2","HOXC4","KLK12"),
          c1 = c("DNAH8","F2RL2","F5","FAM3B","HOXC4"))

# get unique genes
genes <- unique(unlist(l))

# compare genes with each vector in a list
cbind(genes, data.frame(lapply(l, function(i) as.numeric(genes %in% i))))
#    genes a1 b1 c1
# 1  DNAH8  1  0  1
# 2  F2RL2  1  0  1
# 3     F5  1  1  1
# 4  FAM3B  1  0  1
# 5 GRIN3A  0  1  0
# 6 HMGCS2  0  1  0
# 7  HOXC4  0  1  1
# 8  KLK12  0  1  0

或者，如果我们希望显示基因的名称，那么试试这个：

# convert list of vectors into list of data.frames then merge
res <- Reduce(function(...) merge(..., by = "gene", all = TRUE),
              lapply(l, function(i) data.frame(gene = i, var = i)))

# update col names
colnames(res) <- c("allgene", paste0("geneset", 1:length(l)))

res
#    allgene geneset1 geneset2 geneset3
# 1    DNAH8    DNAH8     <NA>    DNAH8
# 2    F2RL2    F2RL2     <NA>    F2RL2
# 3       F5       F5       F5       F5
# 4    FAM3B    FAM3B     <NA>    FAM3B
# 5   GRIN3A     <NA>   GRIN3A     <NA>
# 6   HMGCS2     <NA>   HMGCS2     <NA>
# 7    HOXC4     <NA>    HOXC4    HOXC4
# 8    KLK12     <NA>    KLK12     <NA>

合并三个或更多字符变量

1 个答案: