nrow()在R中提供比原始行更多的行

时间:2013-06-13 19:44:57

标签: r read.table

我有一个文件,第一行有20个字段作为标题。其余行具有不等数量的字段,一些行具有比标题更多的列。当我尝试使用read.delim()读取它时,它会无错误地读取数据,但总行数超过原始数量。

以下是该文件的几行:

Chromosome   Position    SNPid   Reference   Alternate   QUAL    Homozygosity    Tool    Depth   MappingQuality  EFFECT  IMPACT  FUNCTIONAL_CLASS    CODON_CHANGE    AMINO_ACID_CHANGE   GENE_NAME   GENE_BIOTYPE    GENE_CODING     TRANSCRIPT_ID   EXON_ID    
chr1    403111  .   G   A   24  het SAM 20  55  INTERGENIC  MODIFIER    _   _   _   _   _   _   _   _   _
chr1    602567  rs21953190  A   G   3265.77 hom GATKSAM 91  58.46   SYNONYMOUS_CODING   LOW SILENT  gaT/gaC D1034   ADNP2   protein_coding  CODING  ENSCAFT00000000008  5   _
chr1    604894  rs21953191  A   G   2869.77 hom GATKSAM 77  59.70   NON_SYNONYMOUS_CODING   MODERATE    MISSENSE    Ttt/Ctt F259L   ADNP2   protein_coding  CODING  ENSCAFT00000000008  5   _
chr1    758630  .   T   TC  1531.73 hom GATKSAM 38  46.20   INTRON  MODIFIER    _   _   _   PQLC1   protein_coding  CODING  ENSCAFT00000000011  2   _
chr1    800715  .   C   CT  514.73  hom GATKSAM 13  60.00   INTRON  MODIFIER    _   _   _   PQLC1   protein_coding  CODING  ENSCAFT00000000011  6   ,SPLICE_SITE_ACCEPTOR   HIGH    _   _   _   PQLC1   protein_coding  CODING  ENSCAFT00000000011  7   ,SPLICE_SITE_DONOR  HIGH    _   _   _   PQLC1   protein_coding  CODING  ENSCAFT00000000011  6   _
chr1    1104035 rs21966859  G   A   3803.77 hom GATKSAM 97  57.97   INTRON  MODIFIER    _   _   _   NFATC1  protein_coding  CODING  ENSCAFT00000000013  2   ,INTRON MODIFIER    _   _   _   NFATC1  protein_coding  CODING  ENSCAFT00000036234  2   _
chr1    1120994 .   CGCG    C   604.73  hom GATKSAM 21  56.55   INTERGENIC  MODIFIER    _   _   _   _   _   _   _   _   ,UPSTREAM   MODIFIER    _   _   _   NFATC1  protein_coding  CODING  ENSCAFT00000000013  _   ,UPSTREAM   MODIFIER    _   _   _   NFATC1  protein_coding  CODING  ENSCAFT00000036234  _   _
chr1    1136916 rs21935602  G   A   3899.77 hom GATKSAM 101 59.17   DOWNSTREAM  MODIFIER    _   _   _   ATP9B   protein_coding  CODING  ENSCAFT00000000014  _   ,DOWNSTREAM MODIFIER    _   _   _   ATP9B   protein_coding  CODING  ENSCAFT00000042968  _   ,UTR_3_PRIME    MODIFIER    _   _   _   ATP9B   protein_coding  CODING  ENSCAFT00000046825  29  _

文件中有9行。但是当在R中读取并计算行数时,它显示为12.

read.delim("test.txt",header=T,sep='\t')->data
nrow(data)

有人可以帮忙,正确阅读数据吗?

以下是dput(data)

的输出
> dput(data)
structure(list(Chromosome = structure(c(3L, 3L, 3L, 3L, 3L, 1L, 
3L, 2L, 3L, 2L, 3L, 2L), .Label = c("HIGH", "MODIFIER", "chr1"
), class = "factor"), Position = structure(c(4L, 5L, 6L, 7L, 
8L, 9L, 1L, 9L, 2L, 9L, 3L, 9L), .Label = c("1104035", "1120994", 
"1136916", "403111", "602567", "604894", "758630", "800715", 
"_"), class = "factor"), SNPid = structure(c(1L, 4L, 5L, 1L, 
1L, 2L, 6L, 2L, 1L, 2L, 3L, 2L), .Label = c(".", "_", "rs21935602", 
"rs21953190", "rs21953191", "rs21966859"), class = "factor"), 
Reference = structure(c(4L, 1L, 1L, 5L, 2L, 6L, 4L, 6L, 3L, 
6L, 4L, 6L), .Label = c("A", "C", "CGCG", "G", "T", "_"), class = "factor"), 
Alternate = structure(c(1L, 5L, 5L, 8L, 4L, 7L, 1L, 6L, 3L, 
6L, 1L, 2L), .Label = c("A", "ATP9B", "C", "CT", "G", "NFATC1", 
"PQLC1", "TC"), class = "factor"), QUAL = structure(c(2L, 
4L, 3L, 1L, 7L, 9L, 5L, 9L, 8L, 9L, 6L, 9L), .Label = c("1531.73", 
"24", "2869.77", "3265.77", "3803.77", "3899.77", "514.73", 
"604.73", "protein_coding"), class = "factor"), Homozygosity = structure(c(2L, 
3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("CODING", 
"het", "hom"), class = "factor"), Tool = structure(c(6L, 
5L, 5L, 5L, 5L, 1L, 5L, 3L, 5L, 2L, 5L, 4L), .Label = c("ENSCAFT00000000011", 
"ENSCAFT00000000013", "ENSCAFT00000036234", "ENSCAFT00000042968", 
"GATKSAM", "SAM"), class = "factor"), Depth = structure(c(4L, 
9L, 8L, 6L, 2L, 7L, 10L, 3L, 5L, 11L, 1L, 11L), .Label = c("101", 
"13", "2", "20", "21", "38", "7", "77", "91", "97", "_"), class = "factor"), 
MappingQuality = structure(c(5L, 8L, 10L, 4L, 11L, 1L, 7L, 
12L, 6L, 2L, 9L, 3L), .Label = c(",SPLICE_SITE_DONOR", ",UPSTREAM", 
",UTR_3_PRIME", "46.20", "55", "56.55", "57.97", "58.46", 
"59.17", "59.70", "60.00", "_"), class = "factor"), EFFECT = structure(c(4L, 
8L, 7L, 5L, 5L, 3L, 5L, 1L, 4L, 6L, 2L, 6L), .Label = c("", 
"DOWNSTREAM", "HIGH", "INTERGENIC", "INTRON", "MODIFIER", 
"NON_SYNONYMOUS_CODING", "SYNONYMOUS_CODING"), class = "factor"), 
IMPACT = structure(c(4L, 2L, 3L, 4L, 4L, 5L, 4L, 1L, 4L, 
5L, 4L, 5L), .Label = c("", "LOW", "MODERATE", "MODIFIER", 
"_"), class = "factor"), FUNCTIONAL_CLASS = structure(c(4L, 
3L, 2L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 4L), .Label = c("", 
"MISSENSE", "SILENT", "_"), class = "factor"), CODON_CHANGE = structure(c(3L, 
4L, 2L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L), .Label = c("", 
"Ttt/Ctt", "_", "gaT/gaC"), class = "factor"), AMINO_ACID_CHANGE = structure(c(7L, 
3L, 4L, 7L, 7L, 6L, 7L, 1L, 7L, 5L, 7L, 2L), .Label = c("", 
"ATP9B", "D1034", "F259L", "NFATC1", "PQLC1", "_"), class = "factor"), 
GENE_NAME = structure(c(6L, 2L, 2L, 5L, 5L, 7L, 4L, 1L, 6L, 
7L, 3L, 7L), .Label = c("", "ADNP2", "ATP9B", "NFATC1", "PQLC1", 
"_", "protein_coding"), class = "factor"), GENE_BIOTYPE = structure(c(3L, 
4L, 4L, 4L, 4L, 2L, 4L, 1L, 3L, 2L, 4L, 2L), .Label = c("", 
"CODING", "_", "protein_coding"), class = "factor"), GENE_CODING = structure(c(6L, 
2L, 2L, 2L, 2L, 3L, 2L, 1L, 6L, 4L, 2L, 5L), .Label = c("", 
"CODING", "ENSCAFT00000000011", "ENSCAFT00000036234", "ENSCAFT00000046825", 
"_"), class = "factor"), TRANSCRIPT_ID = structure(c(8L, 
4L, 4L, 5L, 5L, 3L, 6L, 1L, 8L, 8L, 7L, 2L), .Label = c("", 
"29", "6", "ENSCAFT00000000008", "ENSCAFT00000000011", "ENSCAFT00000000013", 
"ENSCAFT00000000014", "_"), class = "factor"), EXON_ID = structure(c(5L, 
3L, 3L, 2L, 4L, 5L, 2L, 1L, 5L, 5L, 5L, 5L), .Label = c("", 
"2", "5", "6", "_"), class = "factor"), X = structure(c(6L, 
6L, 6L, 6L, 4L, 1L, 3L, 1L, 5L, 1L, 2L, 1L), .Label = c("", 
",DOWNSTREAM", ",INTRON", ",SPLICE_SITE_ACCEPTOR", ",UPSTREAM", 
"_"), class = "factor")), .Names = c("Chromosome", "Position", 
"SNPid", "Reference", "Alternate", "QUAL", "Homozygosity", "Tool", 
"Depth", "MappingQuality", "EFFECT", "IMPACT", "FUNCTIONAL_CLASS", 
"CODON_CHANGE", "AMINO_ACID_CHANGE", "GENE_NAME", "GENE_BIOTYPE", 
"GENE_CODING", "TRANSCRIPT_ID", "EXON_ID", "X"), class = "data.frame", row.names = c(NA, 
-12L))

2 个答案:

答案 0 :(得分:2)

R认为每行有21个而不是20个字段(每行可能有尾随标签?),而你的第6-9行还有其他字段:

 count.fields("test.txt",sep="\t")
## [1] 21 21 21 21 21 41 31 41 41

这会混淆read.delim,它会试图猜测前5行发生了什么(它不应该,但就是这样)。您可能认为可以使用fill=TRUE来解决此问题,但您不能。

我尝试使用colClassesfill=TRUE来指定字段类型(我使用colClasses=rep("character",41),但您可能猜得更好),但它似乎不起作用,可能是因为你的标题只有21列。

fread包中的data.table函数可以做得更好,但只有当你告诉它不要尝试从#5之后的行中猜测格式时,它才会丢弃列中的数据超过21岁。

library(data.table)
nrow(fread("test.txt",autostart=5))  ## 9

嗯,即使这样也没有按预期工作(即使我设置了header=TRUE,它也没有正确地获取标题,可能是因为第21列没有标题字段...底线是您可能需要弄清楚那些额外的字段是什么,并用它们做更明确的事情(例如添加标题字段......)

基本上,R期望您的数据非常干净。将此示例发送给data.table包的维护者可能是值得的,他们试图使fread尽可能健壮......这将是一个挑战。

答案 1 :(得分:2)

查看数据,您可以看到它被许多融合线高度“变异”。在很多情况下,这些都是以逗号的形式表示的。我认为这些数据的格式与您预期的不同。您在dput数据中的第一个元素是染色体值= c(“HIGH”,“MODIFIER”,“chr1”)的因子。这不是一个明智的结果,指出你对原始数据的组织缺乏了解。您应该将原始文本文件发布到可以通过Internet访问的位置,以便可以检查原始布局。特别是您认为是分隔符的选项卡要么不存在,要么没有被SO接口捕获。

在指向数据样本之后,您应该通过编辑将其放入问题正文中,请尝试删除逗号后面的注释:

 datL <- readLines("~/Downloads/test.txt")
 datLred <- gsub("[,].+$", "", datL)
 read.delim(text=datLred)

> str(read.delim(text=datLred) )
'data.frame':   8 obs. of  21 variables:
 $ Chromosome       : Factor w/ 1 level "chr1": 1 1 1 1 1 1 1 1
 $ Position         : int  403111 602567 604894 758630 800715 1104035 1120994 1136916
 $ SNPid            : Factor w/ 5 levels ".","rs21935602",..: 1 3 4 1 1 5 1 2
 $ Reference        : Factor w/ 5 levels "A","C","CGCG",..: 4 1 1 5 2 4 3 4
 $ Alternate        : Factor w/ 5 levels "A","C","CT","G",..: 1 4 4 5 3 1 2 1
snipped remain columns