Question

我有这个矩阵，一个GO列术语，一个为该术语富集的列基因和该基因的折叠基因

GO_term      Gene_Name  Log2FC
cell adhesion   IGFBP7  1.38
cell adhesion   PVRL4   -1.40
cell adhesion   NCAM1   -1.35
cell-matrix adhesion    ITGA7   -1.20
cell-matrix adhesion    ITGA4   0.75
positive regulation of cell migration   ITGA5   -1.36
positive regulation of cell migration   RRAS2   -0.59
cellular oxidant detoxification FABP1   2.35
cellular oxidant detoxification LTC4S   -0.59
muscle contraction  ACTA2   -1.21
muscle contraction  VCL -1.06

如何将矩阵转换为类似的内容

> head(chord)
      cell adhesion cell-matrix adhesion positive regulation of cell migration cellular oxidant detoxification
PTK2                  0               1                       1
GNA13                 0               0                       1
LEPR                  0               0                       1
APOE                  0               0                       1
CXCR4                 0               0                       1
RECK                  0               0                       1
      muscle contraction      logFC
PTK2                1 -0.6527904
GNA13               1  0.3711599
LEPR                1  2.6539788
APOE                1  0.8698346
CXCR4               1 -2.5647537
RECK                1  3.6926860
>

每个GO项中具有相应logfFC的基因的二进制矩阵

Answer 1

这里有一些数据

df = data.frame(
    row = sample(letters), col = sample(letters),
    stringsAsFactors = FALSE
)

构造一个具有适当尺寸和暗号的矩阵

nrow = length(unique(df$row))
ncol = length(unique(df$col))
m = matrix(0, nrow, ncol, dimnames=list(unique(df$row), unique(df$col)))

并利用两列矩阵的矩阵子集将两列矩阵用作行/列索引来更新值的事实

m[as.matrix(df)] = 1

尚不清楚您要使用log FC做什么，因为每行可能有多个，并且您还没有描述希望对其进行汇总的方式。

Answer 2

假设您有这样的数据文件gene.txt

GO_term,Gene_Name,Log2FC
cell adhesion,IGFBP7,1.38
cell adhesion,PVRL4,-1.40
cell adhesion,NCAM1,-1.35
cell-matrix adhesion,ITGA7,-1.20
cell-matrix adhesion,ITGA4,0.75
positive regulation of cell migration,ITGA5,-1.36
positive regulation of cell migration,RRAS2,-0.59
cellular oxidant detoxification,FABP1,2.35
cellular oxidant detoxification,LTC4S,-0.59
muscle contraction,ACTA2,-1.21
muscle contraction,VCL,-1.06

gene = read.csv("gene.txt")
golevels = levels(gene$GO_term)
genelevels = levels(gene$Gene_Name)
ndf = data.frame(Gene_Name=genelevels)
for (g in golevels){
  ndf[[g]] = 0
}
ndf$Log2FC = 0
index = 1
nc = ncol(ndf)
for (gg in genelevels){
  temp = as.integer(golevels %in% gene[gene$Gene_Name == gg,"GO_term"])
  ndf[index, -c(1,nc)] = temp
  # assuming each type of Gene_Name has unique Log2FC value
  ndf[index, "Log2FC"] = gene[gene$Gene_Name == gg, "Log2FC"][1]
  index = index + 1
}
# transform to matrix
ndf$Gene_Name = NULL
m = as.matrix(ndf)
row.names(m) = genelevels

将此矩阵转换为二进制矩阵

2 个答案: