使用一个变量作为列名,另一个作为R中的值源进行投射

时间:2014-02-13 14:57:35

标签: r reshape2

我有这个数据集,我想以ID.name为行的方式重新制作。 Canonical_Hugo_Symbol是列名,Canonical_Protein_Change是单元格的值。如果没有NA但其他单元格只有0,那就太好了。

mydata.df <- data.frame(ID.name = c("1000", "1000", "1000", "1001","1001","1001","1002","1002" ), Canonical_Protein_Change = c("p.Y1467H", "p.R1466W", "p.*427Q", "p.V320fs","p.S5383fs","p.D519V","p.S51A", "p.K183_splice" ), Canonical_Hugo_Symbol = c("gene1", "gene3", "gene1", "gene1","gene3","gene4","gene1", "gene2" ))

我融化了它:

ff.melt <- melt(mydata.df, id.var = c("ID.name", "Canonical_Hugo_Symbol"))

ff.melt
 ID.name Canonical_Hugo_Symbol                 variable         value
1    1000                 gene1 Canonical_Protein_Change      p.Y1467H
2    1000                 gene3 Canonical_Protein_Change      p.R1466W
3    1000                 gene1 Canonical_Protein_Change       p.*427Q
4    1001                 gene1 Canonical_Protein_Change      p.V320fs
5    1001                 gene3 Canonical_Protein_Change     p.S5383fs
6    1001                 gene4 Canonical_Protein_Change       p.D519V
7    1002                 gene1 Canonical_Protein_Change        p.S51A
8    1002                 gene2 Canonical_Protein_Change p.K183_splice

然后我重铸了它:

ff.cast <- dcast(ff.melt, ID.name ~ Canonical_Hugo_Symbol + value)

我得到了df

ff.cast
  ID.name gene1_p.*427Q gene1_p.S51A gene1_p.V320fs gene1_p.Y1467H gene2_p.K183_splice gene3_p.R1466W gene3_p.S5383fs
 1    1000       p.*427Q         <NA>           <NA>       p.Y1467H                <NA>       p.R1466W            <NA>
 2    1001          <NA>         <NA>       p.V320fs           <NA>                <NA>           <NA>       p.S5383fs
3    1002          <NA>       p.S51A           <NA>           <NA>       p.K183_splice           <NA>             <NA>
  gene4_p.D519V
1          <NA>
2       p.D519V
3          <NA>

它接近我想要的但现在每个“基因”有许多不同名称的列。例如我想将gene1_p.*427Qgene1_p.S51Agene1_p.V320fsgene1_p.Y1467H全部放在一列中。

我也用过:

dcast(mydata.df, ID.name ~ Canonical_Hugo_Symbol, value_var = "Canonical_Protein_Change" )

但我收到此错误消息:

Error in .fun(.value[0], ...) : 2 arguments passed to 'length' which requires 1 > 

由于

我想要这张桌子或类似的东西!谢谢!

  ID.name   gene1    gene2      gene3      gene4
1    1000  Cp.*427Q    0      p.R1466W       0
2    1001  p.V320fs    0      p.S5383fs   p.D519V
3    1002  p.S51A   p.K183        0          0

当我尝试时我越来越近但是这些名字是错的:

  reshape(mydata.df, direction = 'wide', idvar = 'ID.name', timevar = 'Canonical_Hugo_Symbol')

我修复了这些名字:

colnames(mydata.reshape) <- sub("Canonical_Protein_Change.(.*?)","\\1",  colnames(mydata.reshape))

但NA还在那里

1 个答案:

答案 0 :(得分:2)

你可以试试这个:

# concatenate values in cells with more than one value  
dcast(mydata.df, ID.name ~ Canonical_Hugo_Symbol, value.var = "Canonical_Protein_Change",
      fun.aggregate = function(x) paste(x, collapse = "; "), fill = "0")

#   ID.name             gene1         gene2     gene3   gene4
# 1    1000 p.Y1467H; p.*427Q             0  p.R1466W       0
# 2    1001          p.V320fs             0 p.S5383fs p.D519V
# 3    1002            p.S51A p.K183_splice         0       0

# ...or pick the first value in cells with more than one value
dcast(mydata.df, ID.name ~ Canonical_Hugo_Symbol, value.var = "Canonical_Protein_Change",
      fun.aggregate = head, 1, fill = "0")
#   ID.name    gene1         gene2     gene3   gene4
# 1    1000 p.Y1467H             0  p.R1466W       0
# 2    1001 p.V320fs             0 p.S5383fs p.D519V
# 3    1002   p.S51A p.K183_splice         0       0