R :(无法解释)数据帧列的子集不正确

时间:2016-06-23 18:00:23

标签: r dataframe subset

我对R中的文件有一个特殊的问题。简而言之,尽管对数据框进行了子设置,但当我将子集写入文件时,在某些行中,有很多许多列不应该存在(应该只有6个)。

输入文件由806行(包括标题)组成,并且有39个制表符描述的列(使用awk验证每行为39列)。第4列到最后的值可能存在也可能不存在(因此,如果您看到空格,则会有制表符(^ I))。输入文件的一小部分在这里:

Gene_Name   Accession   Identified Proteins 201-L-W1_Inj_1  201-L-W1_Inj_2  201-L-W1_Inj_3  201-L-W2_Inj_1  201-L-W2_Inj_2  201-L-W2_Inj_3  201-L-W3_Inj_1  201-L-W3_Inj_2  201-L-W3_Inj_3  201-M9-W1_Inj_1 201-M9-W1_Inj_2 201-M9-W1_Inj_3 201-M9-W2-Inj-1 201-M9-W2_Inj_2 201-M9-W2_Inj_3 201-M9-W3_Inj_1 201-M9-W3_Inj_2 201-M9-W3_Inj_3 241-L-W1-Inj-1  241-L-W1-Inj-2  241-L-W1-Inj-3  241-L-W2_Inj_1  241-L-W2_Inj_2  241-L-W2_Inj_3  241-L-W3_Inj_1  241-L-W3_Inj_2  241-L-W3_Inj_3  241-M9-W1-Inj-1 241-M9-W1-Inj-2 241-M9-W1-Inj-3 241-M9-W2_Inj_1 241-M9-W2_Inj_2 241-M9-W2_Inj_3 241-M9-W3_Inj_1 241-M9-W3_Inj_2 241-M9-W3_Inj_3
pyrD    P0A7E2  PYRD_ECO57 Dihydroorotate dehydrogenase (quinone) OS=Escherichia coli O157:H7 GN=pyrD PE=3 SV=1                                                                                                 2.7779  1.397   0                                   
ECs3310 Q8XBI1  Q8XBI1_ECO57 Putative uncharacterized protein ECs3310 OS=Escherichia coli O157:H7 GN=ECs3310 PE=4 SV=1                          1.4577  0   12.44                                                                                                           
ECs1643 Q8X306  Q8X306_ECO57 Tail length tape measure protein OS=Escherichia coli O157:H7 GN=ECs1643 PE=4 SV=1                                                                          4.8211  7.3495  4.2218                                                          
kdsA    Q8XDE7  KDSA_ECO57 2-dehydro-3-deoxyphosphooctonate aldolase OS=Escherichia coli O157:H7 GN=kdsA PE=3 SV=1  0   3.0417  1.9614              6.1146  0   7.0412  2.5376  4.128   3.892   11.617  6.4643  9.7451  3.9381  9.3383  3.769   7.3208  6.2054  8.5019              7.4705  6.5698  0   0   6.6558  3.3947              3.4406  2.202   2.4065
accB    P0ABE0  BCCP_ECO57 Biotin carboxyl carrier protein of acetyl-CoA carboxylase OS=Escherichia coli O157:H7 GN=accB PE=3 SV=1  33.051  26.177  33.725  72.514  76.632  69.373  28.365  24.361  28.925  18.539  26.286  45.222  17.288  15.371  14.752              51.929  71.73   83.253  14.222  6.3663  14.639  0   15.463  15.532  40.591  46.665  6.1286  7.8726  6.3663  6.0564  3.9755  3.1308  5.1279
rpsH P0A7W9 RS8_ECO57 30S ribosomal protein S8 OS=Escherichia coli O157:H7 GN=rpsH PE=3 SV=2 25.085 22.069 22.847 13.212 12.468 17.123 0 13.804 11.955 15.179 12.011 12.65 41.011 40.82 39.526 26.52 9.9107 25.237 37.181 31.671 35.152 22.441 20.259 35.828 10.233 9.9107 9.154 11.521 10.518 10.781

以下是一些代码:

myDF <- read.table('corpus.txt', header=T, sep='\t', row.names=NULL, strip.white=TRUE)

df_201_L_W1 <- myDF[, c(1, 2, 3, 4, 5, 6)]
write.table(df_201_L_W1, 'test.txt', sep = '\t', row.names = F, col.names = T, quote = F)

以下是输出test.txt文件中的一些选定行:

Gene_Name   Accession   Identified.Proteins X201.L.W1_Inj_1 X201.L.W1_Inj_2 X201.L.W1_Inj_3
pyrD    P0A7E2  PYRD_ECO57 Dihydroorotate dehydrogenase (quinone) OS=Escherichia coli O157:H7 GN=pyrD PE=3 SV=1 NA  NA  NA
ECs3310 Q8XBI1  Q8XBI1_ECO57 Putative uncharacterized protein ECs3310 OS=Escherichia coli O157:H7 GN=ECs3310 PE=4 SV=1  NA  NA  NA
bcp P0AE54 BCP_ECO57 Putative peroxiredoxin bcp OS=Escherichia coli O157:H7 GN=bcp PE=3 SV=1 21.141 20.656 21.848 19.244 22.566 24.825 21.479 39.426 31.104 21.106 15.923 18.584 21.353 20.523 22.85 40.793 39.367 43.937 18.917 16.638 19.231 25.408 28.161 26.875 55.172 57.421 10.651 15.704 16.638 16.5 22.632 21.546 22.463
rpsH P0A7W9 RS8_ECO57 30S ribosomal protein S8 OS=Escherichia coli O157:H7 GN=rpsH PE=3 SV=2 25.085 22.069 22.847

倒数第二行包含的内容超过6列,我为何不知所措。一如既往地非常感谢帮助。

2 个答案:

答案 0 :(得分:3)

蛋白质名称通常有单引号,

5'-methylthioadenosine phosphorylase  
ATP synthase B' chain
ppGpp 3'-pyrophosphohydrolase

因此,请尝试将quote=""添加到read.table选项。

答案 1 :(得分:0)

除了Chris S的回应之外,使用readr软件包也产生了预期的结果。

myDF <- read_delim('corpus.txt', delim='\t')

[编辑]根据Ben的陈述,read.delim应该可以开箱即用。