使用缺失值重塑数据

时间:2014-09-29 15:36:30

标签: r

我有一个类似于此的数据集...

Id  Disease     Gene    Mutation    Expression
101 Disease_X   Gene_A  R273G       Normal
101 Disease_X   GENE_B  G12D        Normal
102 Disease_Y   GENE_C  L858R       High

我希望重塑它,使得即使没有值存在,每个id和基因对都表示为突变和表达。

例如,每个id-gene对将有6个可能的值(3个基因用于Mutation,3个用于Expression),如果原始表中没有Mutation或Expression的值,输出将提供一些标准输出以供缺失填充该行的数据(例如" No Mutation data")。表输出如下所示:

Id  Disease     Type        Gene    Value
101 Disease_X   Mutation    Gene A  R273G
101 Disease_X   Mutation    GENE B  G12D
101 Disease_X   Mutation    GENE C  No Mutation Data
101 Disease_X   Expression  Gene A  Normal
101 Disease_X   Expression  GENE B  Normal
101 Disease_X   Expression  GENE C  No Expression Data 
102 Disease_Y   Mutation    Gene A  No Mutation Data
102 Disease_Y   Mutation    GENE B  No Mutation Data
102 Disease_Y   Mutation    GENE C  L858R
102 Disease_Y   Expression  Gene A  No Expression Value
102 Disease_Y   Expression  GENE B  No Expression Value
102 Disease_Y   Expression  GENE C  High

我知道有一种简单的方法可以做到这一点(使用合并或融化?)但我还没有想出任何简单明了的事情。

1 个答案:

答案 0 :(得分:3)

您需要执行一些额外步骤才能获得您正在寻找的内容。

在下文中,我首先制作" Id"," Type"和" Gene"的所有组合,将其与&#合并34;长"数据集的形式,然后修复"疾病"列。

我已将NA作为NA离开,因为如果您需要继续工作,这对我来说似乎更有意义。

这假设您从名为" mydf"的数据集开始。

library(data.table)
library(reshape2)

DT <- as.data.table(mydf)                                ## Convert to data.table
DTL <- melt(DT, id.vars = c("Id", "Disease", "Gene"))    ## Make it long
groups <- c("Id", "Gene", "variable")                    ## Save some typing
toMerge <- do.call(CJ, lapply(DTL[, groups,              ## Generate the combos
                                  with = FALSE], unique))
merged <- merge(DTL, toMerge, by = groups, all = TRUE)   ## merge
merged[, Disease := unique(na.omit(Disease)), by = Id][] ## Fill in Disease
#      Id   Gene   variable   Disease  value
#  1: 101 GENE_B   Mutation Disease_X   G12D
#  2: 101 GENE_B Expression Disease_X Normal
#  3: 101 GENE_C   Mutation Disease_X     NA
#  4: 101 GENE_C Expression Disease_X     NA
#  5: 101 Gene_A   Mutation Disease_X  R273G
#  6: 101 Gene_A Expression Disease_X Normal
#  7: 102 GENE_B   Mutation Disease_Y     NA
#  8: 102 GENE_B Expression Disease_Y     NA
#  9: 102 GENE_C   Mutation Disease_Y  L858R
# 10: 102 GENE_C Expression Disease_Y   High
# 11: 102 Gene_A   Mutation Disease_Y     NA
# 12: 102 Gene_A Expression Disease_Y     NA