我有一个问题,我自己无法解决。在这里http://www.filedropper.com/data_31您可以下载我的数据。它是一个小的txt文件,包含有关Pathway,Seqs in Pathway,Enzyme,Enzyme ID,Seqs of Enzyme,Seqs Pathway ID的信息。
我想重塑/重新组织我的数据,所以它看起来像这样:
NODE_1114.... map00592 alpha-Linolenic acid metabolism map01040 Biosynthesis of unsaturated fatty acids
NODE_11280... map00592 alpha-Linolenic acid metabolism NA NA
NODE_1307.... NA NA map01040 Biosynthesis of unsaturated fatty acids
问题是我不知道如何从这个
重新组织我的数据NODE_12982_length_530_cov_49.8358_ID_25963,NODE_24530_length_385_cov_7.38485_ID_49059,NODE_44451_length_263_cov_34.6298_ID_88901,NODE_19986_length_437_cov_5.82461_ID_39971,
NODE_28195_length_354_cov_77.194_ID_56389
到这个
NODE_12982_length_530_cov_49.8358_ID_25963
NODE_24530_length_385_cov_7.38485_ID_49059
NODE_44451_length_263_cov_34.6298_ID_88901
NODE_19986_length_437_cov_5.82461_ID_39971
NODE_28195_length_354_cov_77.194_ID_56389
以及如何向每个Seqs(NODE ...)添加有关Pathway和Pathway ID的其他信息。
感谢您的帮助!
感谢Imo& nilsole为你的答案,但你错过了这一点 这是我的数据代码:
Pathway<-rep(c("alpha-Linolenic acid metabolism","Biosynthesis of unsaturated fatty acids"), each=5)
Seq<-c("NODE_12982_length_530_cov_49.8358_ID_25963, NODE_24530_length_385_cov_7.38485_ID_49059, NODE_44451_length_263_cov_34.6298_ID_88901, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_28195_length_354_cov_77.194_ID_56389","NODE_8410_length_627_cov_229.406_ID_16819, NODE_3911_length_812_cov_32.037_ID_7821, NODE_13098_length_528_cov_13.4376_ID_26195, NODE_956_length_1151_cov_11.6797_ID_1911, NODE_4501_length_777_cov_61.2355_ID_9001, NODE_60851_length_208_cov_61.9935_ID_121701, NODE_50593_length_239_cov_608.397_ID_101185, NODE_29294_length_345_cov_1.22069_ID_58587, NODE_57887_length_216_cov_22.6087_ID_115773, NODE_14782_length_501_cov_3.03139_ID_29563, NODE_18662_length_451_cov_798.495_ID_37323, NODE_26461_length_368_cov_3.02556_ID_52921, NODE_56026_length_221_cov_2.91566_ID_112051, NODE_12405_length_540_cov_270.652_ID_24809, NODE_2990_length_874_cov_45.3675_ID_5979, NODE_4753_length_763_cov_7.11864_ID_9505, NODE_17275_length_467_cov_4.0267_ID_34549, NODE_21751_length_416_cov_41.4155_ID_43501, NODE_53355_length_230_cov_19.48_ID_106709, NODE_49191_length_244_cov_1.51852_ID_98381"
,"NODE_61001_length_208_cov_76.3987_ID_122001, NODE_14350_length_507_cov_66.9845_ID_28699, NODE_16148_length_482_cov_189.293_ID_32295, NODE_42206_length_273_cov_135.404_ID_84411, NODE_11280_length_561_cov_335.174_ID_22559, NODE_21858_length_415_cov_31.0306_ID_43715, NODE_824_length_1186_cov_6.48364_ID_1647, NODE_41473_length_276_cov_2.73303_ID_82945, NODE_46025_length_257_cov_166.455_ID_92049",
"NODE_32320_length_325_cov_56.6037_ID_64639, NODE_38741_length_289_cov_27.1795_ID_77481, NODE_9047_length_611_cov_12.6511_ID_18093, NODE_1114_length_1113_cov_24.6059_ID_2227, NODE_47802_length_250_cov_40.8513_ID_95603, NODE_60092_length_210_cov_142.471_ID_120183, NODE_28312_length_353_cov_8.38926_ID_56623",
"NODE_4925_length_754_cov_2.56509_ID_9849, NODE_16010_length_484_cov_322.536_ID_32019, NODE_51261_length_237_cov_33.9011_ID_102521, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_1384_length_1058_cov_1.86939_ID_2767",
"NODE_12982_length_530_cov_49.8358_ID_25963, NODE_24530_length_385_cov_7.38485_ID_49059, NODE_44451_length_263_cov_34.6298_ID_88901, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_28195_length_354_cov_77.194_ID_56389",
"NODE_32320_length_325_cov_56.6037_ID_64639, NODE_38741_length_289_cov_27.1795_ID_77481, NODE_9047_length_611_cov_12.6511_ID_18093, NODE_1114_length_1113_cov_24.6059_ID_2227, NODE_47802_length_250_cov_40.8513_ID_95603, NODE_60092_length_210_cov_142.471_ID_120183, NODE_28312_length_353_cov_8.38926_ID_56623",
"NODE_1114_length_1113_cov_24.6059_ID_2227, NODE_28195_length_354_cov_77.194_ID_56389",
"NODE_1307_length_1072_cov_19.1504_ID_2613, NODE_3418_length_843_cov_15.3959_ID_6835","NODE_4925_length_754_cov_2.56509_ID_9849, NODE_16010_length_484_cov_322.536_ID_32019, NODE_51261_length_237_cov_33.9011_ID_102521, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_1384_length_1058_cov_1.86939_ID_2767")
Pathway_ID<-rep(c("map00592","map01040"),each=5)
df<-data.frame(Pathway,Seq,Pathway_ID)
数据如下所示:
Pathway Seq Pathway_ID
aplha-Linolenic acid metabolism NODE_12982...,NODE_8410.. map00592
aplha-Linolenic acid metabolism NODE....,NODE... map00592
aplha-Linolenic acid metabolism NODE....,NODE... map00592
aplha-Linolenic acid metabolism NODE....,NODE... map00592
aplha-Linolenic acid metabolism NODE....,NODE... map00592
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040
我想看起来像这样:
NODE_1114.... map00592 alpha-Linolenic acid metabolism map01040 Biosynthesis of unsaturated fatty acids
NODE_11280... map00592 alpha-Linolenic acid metabolism NA NA
NODE_1307.... NA NA map01040 Biosynthesis of unsaturated fatty acids
使用strsplit
功能,我丢失了NODE...
每个Pathway
和Pathway_ID
所属的信息。在Seq列中有NODEs
的不同数量,在这种情况下,一些NODEs
也属于这两个路径我希望将两个路径分配给特定的NODE
;像这样
NODE_1114.... map00592 alpha-Linolenic acid metabolism map01040 Biosynthesis of unsaturated fatty acids
我希望你能帮助我!谢谢!
答案 0 :(得分:2)
我在reshape2
包的帮助下以及cbind
,unique
和subset
等基本元素的帮助下找到了解决方案。
df<-data.frame(Pathway=rep(c("acid_metabolism", "fatty acids biosynthesis"), each=5),
Seq=c("Contig_A, Contig_B, Contig_C, Contig_D", "Contig_C, Contig_E,
Contig_F,","Contig_D, Contig_F, Contig_G, Contig_H, Contig_I,
Contig_J, Contig_K","Contig_C, Contig_D","Contig_H, Contig_I,
Contig_J","Contig_H, Contig_I, Contig_L, Contig_M","Contig_C",
"Contig_F, Contig_G, Contig_N","Contig_E, Contig_F, Contig_D",
"Contig_N, Contig_O"),Path_ID=rep(c("map_A","map_B"),each=5))
> head(df)
Pathway
1 acid_metabolism
2 acid_metabolism
3 acid_metabolism
4 acid_metabolism
5 acid_metabolism
6 fatty acids biosynthesis
Seq Path_ID
1 Contig_A, Contig_B, Contig_C, Contig_D map_A
2 Contig_C, Contig_E, Contig_F, map_A
3 Contig_D, Contig_F, Contig_G, Contig_H, Contig_I, Contig_J, Contig_K map_A
4 Contig_C, Contig_D map_A
5 Contig_H, Contig_I, Contig_J map_A
6 Contig_H, Contig_I, Contig_L, Contig_M map_B
第1步:从一列中分割数据......
Seq
1 Contig_A, Contig_B, Contig_C, Contig_D
到多列...
V1 V2 V3 V4
1 Contig_A Contig_B Contig_C Contig_D
我的数据存在的问题是列条目数不同的字符串。来自@G的回答。格洛腾迪克以Split strings into columns in R where each string has a potentially different number of column entries的问题帮助了我。
df2<-cbind(df, read.table(text = as.character(df$Seq), sep = ",", fill = TRUE, as.is = TRUE))
> head(df2)
Pathway
1 acid_metabolism
2 acid_metabolism
3 acid_metabolism
4 acid_metabolism
5 acid_metabolism
6 fatty acids biosynthesis
Seq Path_ID V1
1 Contig_A, Contig_B, Contig_C, Contig_D map_A Contig_A
2 Contig_C, Contig_E, Contig_F, map_A Contig_C
3 Contig_D, Contig_F, Contig_G, Contig_H, Contig_I, Contig_J, Contig_K map_A Contig_D
4 Contig_C, Contig_D map_A Contig_C
5 Contig_H, Contig_I, Contig_J map_A Contig_H
6 Contig_H, Contig_I, Contig_L, Contig_M map_B Contig_H
V2 V3 V4 V5 V6 V7
1 Contig_B Contig_C Contig_D
2 Contig_E Contig_F
3 Contig_F Contig_G Contig_H Contig_I Contig_J Contig_K
4 Contig_D
5 Contig_I Contig_J
6 Contig_I Contig_L Contig_M
步骤2:重塑数据,使每个Contig成为一个字符串,其中包含Pathway和Path_ID的附加信息。来自melt
的{{1}}函数解决了这个问题。
reshape2
步骤3:将数据分组以除去不必要的列(变量)并将Pathway和Path_ID一起调和。
df2.m<-melt(df2, id.var = c("Pathway","Path_ID"))
> df2.m[11:20,]
Pathway Path_ID variable value
11 acid_metabolism map_A V1 Contig_A
12 acid_metabolism map_A V1 Contig_C
13 acid_metabolism map_A V1 Contig_D
14 acid_metabolism map_A V1 Contig_C
15 acid_metabolism map_A V1 Contig_H
16 fatty acids biosynthesis map_B V1 Contig_H
17 fatty acids biosynthesis map_B V1 Contig_C
18 fatty acids biosynthesis map_B V1 Contig_F
19 fatty acids biosynthesis map_B V1 Contig_E
20 fatty acids biosynthesis map_B V1 Contig_N
步骤4:以每个Contig将分配所有KEGG途径的方式重塑数据。为此,我使用了df2.m.subset<-subset(df2.m, select=c("Pathway","Path_ID","value"))
df2.m.subset2<-data.frame(df2.m.subset$value,
KEGG=paste(df2.m.subset$Pathway,df2.m.subset$Path_ID,sep="; "))
> df2.m.subset2[11:20,]
df2.m.subset.value KEGG
11 Contig_A acid_metabolism; map_A
12 Contig_C acid_metabolism; map_A
13 Contig_D acid_metabolism; map_A
14 Contig_C acid_metabolism; map_A
15 Contig_H acid_metabolism; map_A
16 Contig_H fatty acids biosynthesis; map_B
17 Contig_C fatty acids biosynthesis; map_B
18 Contig_F fatty acids biosynthesis; map_B
19 Contig_E fatty acids biosynthesis; map_B
20 Contig_N fatty acids biosynthesis; map_B
包中的dcast
函数。 reshape2
需要唯一值。所以我使用了基础包中的dcast
...
unique
...然后df2.u<-unique(df2.m.subset2)
dcast
答案 1 :(得分:0)
stringSource <- "NODE_12982_length_530_cov_49.8358_ID_25963,NODE_24530_length_385_cov_7.38485_ID_49059,NODE_44451_length_263_cov_34.6298_ID_88901,NODE_19986_length_437_cov_5.82461_ID_39971,
NODE_28195_length_354_cov_77.194_ID_56389"
# create df
df <- data.frame(
"nodes" = unlist(strsplit(x = stringSource,","))
)
df$furtherColumn <- NA
最后一行暗示:现在你有了一个合适的数据框,随时可以使用该对象。