重组R中的数据

时间:2016-08-30 13:28:56

标签: r

我有一个问题,我自己无法解决。在这里http://www.filedropper.com/data_31您可以下载我的数据。它是一个小的txt文件,包含有关Pathway,Seqs in Pathway,Enzyme,Enzyme ID,Seqs of Enzyme,Seqs Pathway ID的信息。
我想重塑/重新组织我的数据,所以它看起来像这样:

NODE_1114.... map00592 alpha-Linolenic acid metabolism map01040 Biosynthesis of unsaturated fatty acids  
NODE_11280... map00592 alpha-Linolenic acid metabolism NA       NA  
NODE_1307.... NA       NA                              map01040 Biosynthesis of unsaturated fatty acids  

问题是我不知道如何从这个

重新组织我的数据
NODE_12982_length_530_cov_49.8358_ID_25963,NODE_24530_length_385_cov_7.38485_ID_49059,NODE_44451_length_263_cov_34.6298_ID_88901,NODE_19986_length_437_cov_5.82461_ID_39971,
NODE_28195_length_354_cov_77.194_ID_56389

到这个

NODE_12982_length_530_cov_49.8358_ID_25963
NODE_24530_length_385_cov_7.38485_ID_49059
NODE_44451_length_263_cov_34.6298_ID_88901
NODE_19986_length_437_cov_5.82461_ID_39971
NODE_28195_length_354_cov_77.194_ID_56389

以及如何向每个Seqs(NODE ...)添加有关Pathway和Pathway ID的其他信息。

感谢您的帮助!

EDITED

感谢Imo& nilsole为你的答案,但你错过了这一点 这是我的数据代码:

Pathway<-rep(c("alpha-Linolenic acid metabolism","Biosynthesis of unsaturated fatty acids"), each=5)
Seq<-c("NODE_12982_length_530_cov_49.8358_ID_25963, NODE_24530_length_385_cov_7.38485_ID_49059, NODE_44451_length_263_cov_34.6298_ID_88901, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_28195_length_354_cov_77.194_ID_56389","NODE_8410_length_627_cov_229.406_ID_16819, NODE_3911_length_812_cov_32.037_ID_7821, NODE_13098_length_528_cov_13.4376_ID_26195, NODE_956_length_1151_cov_11.6797_ID_1911, NODE_4501_length_777_cov_61.2355_ID_9001, NODE_60851_length_208_cov_61.9935_ID_121701, NODE_50593_length_239_cov_608.397_ID_101185, NODE_29294_length_345_cov_1.22069_ID_58587, NODE_57887_length_216_cov_22.6087_ID_115773, NODE_14782_length_501_cov_3.03139_ID_29563, NODE_18662_length_451_cov_798.495_ID_37323, NODE_26461_length_368_cov_3.02556_ID_52921, NODE_56026_length_221_cov_2.91566_ID_112051, NODE_12405_length_540_cov_270.652_ID_24809, NODE_2990_length_874_cov_45.3675_ID_5979, NODE_4753_length_763_cov_7.11864_ID_9505, NODE_17275_length_467_cov_4.0267_ID_34549, NODE_21751_length_416_cov_41.4155_ID_43501, NODE_53355_length_230_cov_19.48_ID_106709, NODE_49191_length_244_cov_1.51852_ID_98381"
       ,"NODE_61001_length_208_cov_76.3987_ID_122001, NODE_14350_length_507_cov_66.9845_ID_28699, NODE_16148_length_482_cov_189.293_ID_32295, NODE_42206_length_273_cov_135.404_ID_84411, NODE_11280_length_561_cov_335.174_ID_22559, NODE_21858_length_415_cov_31.0306_ID_43715, NODE_824_length_1186_cov_6.48364_ID_1647, NODE_41473_length_276_cov_2.73303_ID_82945, NODE_46025_length_257_cov_166.455_ID_92049",
       "NODE_32320_length_325_cov_56.6037_ID_64639, NODE_38741_length_289_cov_27.1795_ID_77481, NODE_9047_length_611_cov_12.6511_ID_18093, NODE_1114_length_1113_cov_24.6059_ID_2227, NODE_47802_length_250_cov_40.8513_ID_95603, NODE_60092_length_210_cov_142.471_ID_120183, NODE_28312_length_353_cov_8.38926_ID_56623",
       "NODE_4925_length_754_cov_2.56509_ID_9849, NODE_16010_length_484_cov_322.536_ID_32019, NODE_51261_length_237_cov_33.9011_ID_102521, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_1384_length_1058_cov_1.86939_ID_2767",
       "NODE_12982_length_530_cov_49.8358_ID_25963, NODE_24530_length_385_cov_7.38485_ID_49059, NODE_44451_length_263_cov_34.6298_ID_88901, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_28195_length_354_cov_77.194_ID_56389",
       "NODE_32320_length_325_cov_56.6037_ID_64639, NODE_38741_length_289_cov_27.1795_ID_77481, NODE_9047_length_611_cov_12.6511_ID_18093, NODE_1114_length_1113_cov_24.6059_ID_2227, NODE_47802_length_250_cov_40.8513_ID_95603, NODE_60092_length_210_cov_142.471_ID_120183, NODE_28312_length_353_cov_8.38926_ID_56623",
       "NODE_1114_length_1113_cov_24.6059_ID_2227, NODE_28195_length_354_cov_77.194_ID_56389",
       "NODE_1307_length_1072_cov_19.1504_ID_2613, NODE_3418_length_843_cov_15.3959_ID_6835","NODE_4925_length_754_cov_2.56509_ID_9849, NODE_16010_length_484_cov_322.536_ID_32019, NODE_51261_length_237_cov_33.9011_ID_102521, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_1384_length_1058_cov_1.86939_ID_2767")
Pathway_ID<-rep(c("map00592","map01040"),each=5)
df<-data.frame(Pathway,Seq,Pathway_ID)

数据如下所示:

Pathway                                 Seq                       Pathway_ID  
aplha-Linolenic acid metabolism         NODE_12982...,NODE_8410.. map00592
aplha-Linolenic acid metabolism         NODE....,NODE...          map00592
aplha-Linolenic acid metabolism         NODE....,NODE...          map00592
aplha-Linolenic acid metabolism         NODE....,NODE...          map00592
aplha-Linolenic acid metabolism         NODE....,NODE...          map00592
Biosynthesis of unsaturated fatty acids NODE....,NODE...          map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE...          map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE...          map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE...          map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE...          map01040

我想看起来像这样:

NODE_1114.... map00592 alpha-Linolenic acid metabolism map01040 Biosynthesis of unsaturated fatty acids  
NODE_11280... map00592 alpha-Linolenic acid metabolism NA       NA  
NODE_1307.... NA       NA                              map01040 Biosynthesis of unsaturated fatty acids  

使用strsplit功能,我丢失了NODE...每个PathwayPathway_ID所属的信息。在Seq列中有NODEs的不同数量,在这种情况下,一些NODEs也属于这两个路径我希望将两个路径分配给特定的NODE;像这样

NODE_1114.... map00592 alpha-Linolenic acid metabolism map01040 Biosynthesis of unsaturated fatty acids

我希望你能帮助我!谢谢!

2 个答案:

答案 0 :(得分:2)

我在reshape2包的帮助下以及cbinduniquesubset等基本元素的帮助下找到了解决方案。

df<-data.frame(Pathway=rep(c("acid_metabolism", "fatty acids biosynthesis"), each=5),
                 Seq=c("Contig_A, Contig_B, Contig_C, Contig_D", "Contig_C, Contig_E,
                        Contig_F,","Contig_D, Contig_F, Contig_G, Contig_H, Contig_I,
                        Contig_J, Contig_K","Contig_C, Contig_D","Contig_H, Contig_I,
                        Contig_J","Contig_H, Contig_I, Contig_L, Contig_M","Contig_C",
                       "Contig_F, Contig_G, Contig_N","Contig_E, Contig_F, Contig_D",
                       "Contig_N, Contig_O"),Path_ID=rep(c("map_A","map_B"),each=5))
> head(df)
                   Pathway
1          acid_metabolism
2          acid_metabolism
3          acid_metabolism
4          acid_metabolism
5          acid_metabolism
6 fatty acids biosynthesis
                                                                   Seq Path_ID
1                               Contig_A, Contig_B, Contig_C, Contig_D   map_A
2                                        Contig_C, Contig_E, Contig_F,   map_A
3 Contig_D, Contig_F, Contig_G, Contig_H, Contig_I, Contig_J, Contig_K   map_A
4                                                   Contig_C, Contig_D   map_A
5                                         Contig_H, Contig_I, Contig_J   map_A
6                               Contig_H, Contig_I, Contig_L, Contig_M   map_B

第1步:从一列中分割数据......

                                      Seq 
1  Contig_A, Contig_B, Contig_C, Contig_D   

到多列...

         V1        V2        V3        V4
1  Contig_A  Contig_B  Contig_C  Contig_D  

我的数据存在的问题是列条目数不同的字符串。来自@G的回答。格洛腾迪克以Split strings into columns in R where each string has a potentially different number of column entries的问题帮助了我。

df2<-cbind(df, read.table(text = as.character(df$Seq), sep = ",", fill = TRUE, as.is = TRUE))
> head(df2)
                   Pathway
1          acid_metabolism
2          acid_metabolism
3          acid_metabolism
4          acid_metabolism
5          acid_metabolism
6 fatty acids biosynthesis
                                                                   Seq Path_ID       V1
1                               Contig_A, Contig_B, Contig_C, Contig_D   map_A Contig_A
2                                        Contig_C, Contig_E, Contig_F,   map_A Contig_C
3 Contig_D, Contig_F, Contig_G, Contig_H, Contig_I, Contig_J, Contig_K   map_A Contig_D
4                                                   Contig_C, Contig_D   map_A Contig_C
5                                         Contig_H, Contig_I, Contig_J   map_A Contig_H
6                               Contig_H, Contig_I, Contig_L, Contig_M   map_B Contig_H
         V2        V3        V4        V5        V6        V7
1  Contig_B  Contig_C  Contig_D                              
2  Contig_E  Contig_F                                        
3  Contig_F  Contig_G  Contig_H  Contig_I  Contig_J  Contig_K
4  Contig_D                                                  
5  Contig_I  Contig_J                                        
6  Contig_I  Contig_L  Contig_M

步骤2:重塑数据,使每个Contig成为一个字符串,其中包含Pathway和Path_ID的附加信息。来自melt的{​​{1}}函数解决了这个问题。

reshape2

步骤3:将数据分组以除去不必要的列(变量)并将Pathway和Path_ID一起调和。

df2.m<-melt(df2, id.var = c("Pathway","Path_ID"))
> df2.m[11:20,]
                    Pathway Path_ID variable    value
11          acid_metabolism   map_A       V1 Contig_A
12          acid_metabolism   map_A       V1 Contig_C
13          acid_metabolism   map_A       V1 Contig_D
14          acid_metabolism   map_A       V1 Contig_C
15          acid_metabolism   map_A       V1 Contig_H
16 fatty acids biosynthesis   map_B       V1 Contig_H
17 fatty acids biosynthesis   map_B       V1 Contig_C
18 fatty acids biosynthesis   map_B       V1 Contig_F
19 fatty acids biosynthesis   map_B       V1 Contig_E
20 fatty acids biosynthesis   map_B       V1 Contig_N 

步骤4:以每个Contig将分配所有KEGG途径的方式重塑数据。为此,我使用了df2.m.subset<-subset(df2.m, select=c("Pathway","Path_ID","value")) df2.m.subset2<-data.frame(df2.m.subset$value, KEGG=paste(df2.m.subset$Pathway,df2.m.subset$Path_ID,sep="; ")) > df2.m.subset2[11:20,] df2.m.subset.value KEGG 11 Contig_A acid_metabolism; map_A 12 Contig_C acid_metabolism; map_A 13 Contig_D acid_metabolism; map_A 14 Contig_C acid_metabolism; map_A 15 Contig_H acid_metabolism; map_A 16 Contig_H fatty acids biosynthesis; map_B 17 Contig_C fatty acids biosynthesis; map_B 18 Contig_F fatty acids biosynthesis; map_B 19 Contig_E fatty acids biosynthesis; map_B 20 Contig_N fatty acids biosynthesis; map_B 包中的dcast函数。 reshape2需要唯一值。所以我使用了基础包中的dcast ...

unique

...然后df2.u<-unique(df2.m.subset2)

dcast

答案 1 :(得分:0)

stringSource <- "NODE_12982_length_530_cov_49.8358_ID_25963,NODE_24530_length_385_cov_7.38485_ID_49059,NODE_44451_length_263_cov_34.6298_ID_88901,NODE_19986_length_437_cov_5.82461_ID_39971,
NODE_28195_length_354_cov_77.194_ID_56389"

# create df
df <- data.frame(
  "nodes" = unlist(strsplit(x = stringSource,","))
)

df$furtherColumn <- NA

最后一行暗示:现在你有了一个合适的数据框,随时可以使用该对象。