这是我第一次使用read.table遇到此问题:对于列数非常大的行条目,read.table将列条目循环到下一行。
我有一个带有变量和不等长度行的.txt文件。作为参考,这是我正在阅读的.txt文件:http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c5.bp.v4.0.symbols.gmt
这是我的代码:
tabsep <- gsub("\\\\t", "\t", "\\t")
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = tabsep)
部分输出:第一列
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP
7 INTS6 LSM5 LSM4 LSM3 LSM1
8 CRK
9 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B
10 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3
...
部分输出:最后一列
V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416 V417 V418 V419 V420 V421
1
2 CALCA CALCB FAM107A CDK11A RASGRP4 CDK11B SYN3 GP1BA TNN ENO1 PTPRC MTL5 ISOC2 RHAG VWF GPI HPX SLC5A7 F2R
3
4
5
6 IRF2 IRF3 SLC2A4RG LSM6 XRCC6 INTS1 HOXD13 RP9 INTS2 ZNF638 INTS3 ZNF254 CITED1 CITED2 INTS9 INTS8 INTS5 INTS4 INTS7
7 POU1F1 TCF7L2 TNFRSF1A NPAS2 HAND1 HAND2 NUDT21 APEX1 ENO1 ERF DTX1 SOX30 CBY1 DIS3 SP1 SP2 SP3 SP4 NFIC
8
9
10
例如,第6行的列条目循环以填充第7行和第8行。对于具有非常大的列数的行条目,我似乎只有这个问题。对于其他.txt文件也会发生这种情况,但它会在不同的列号处中断。我检查了中断发生的所有行条目,条目中没有异常字符(它们都是标准的大写基因符号)。
我尝试了read.table和read.delim,结果相同。如果我首先将.txt文件转换为.csv并使用相同的代码,我没有这个问题(请参阅下面的等效输出)。但我不想先将每个文件转换为.csv,而且我只是想了解发生了什么。
如果我转换为.csv文件,请更正输出:
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = ",")
V1 V2 V3 V4 V5 V6
1 TRNA_PROCESSING http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1 FARS2 METTL1
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2 SLC9A7 PTGS2
3 DNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4 RAD51C XRCC3
4 AMINO_SUGAR_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA GNPDA1 GNE
5 BIOPOLYMER_CATABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD USE1 RNASEH1
6 RNA_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP MED24
7 GLUCAN_METABOLIC_PROCESS http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS GCK PYGM GSK3B EPM2A
8 PROTEIN_POLYUBIQUITINATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION ERCC8 HUWE1 DZIP3 DDB2
9 PROTEIN_OLIGOMERIZATION http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_OLIGOMERIZATION SYT1 AASS TP63 HPRT1
答案 0 :(得分:4)
详细说明我的评论......
从帮助页面到read.table
:
数据列的数量是通过查看前五行输入(或整个文件,如果它少于五行)确定的,或者是
col.names
的长度(如果已指定并且更长) 。如果fill
或blank.lines.skip
为真,则可能会出错,因此请在必要时指定col.names
(如“示例”中所示)。
要使用未知数据集解决此问题,请使用count.fields
确定文件中分隔符的数量,并使用该分隔符为col.names
创建read.table
以使用:
x <- max(count.fields("~/Downloads/c5.bp.v4.0.symbols.gmt", "\t"))
Names <- paste("V", sequence(x), sep = "")
y <- read.table("~/Downloads/c5.bp.v4.0.symbols.gmt", col.names=Names, sep = "\t", fill = TRUE)
检查前几行。我将把实际的全面检查留给你。
y[1:6, 1:10]
# V1
# 1 TRNA_PROCESSING
# 2 REGULATION_OF_BIOLOGICAL_QUALITY
# 3 DNA_METABOLIC_PROCESS
# 4 AMINO_SUGAR_METABOLIC_PROCESS
# 5 BIOPOLYMER_CATABOLIC_PROCESS
# 6 RNA_METABOLIC_PROCESS
# V2 V3 V4
# 1 http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1 TRNT1
# 2 http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY DLC1 ALS2
# 3 http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS XRCC5 XRCC4
# 4 http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS UAP1 CHIA
# 5 http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS BTRC HNRNPD
# 6 http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD
# V5 V6 V7 V8 V9 V10
# 1 FARS2 METTL1 SARS AARS THG1L SSB
# 2 SLC9A7 PTGS2 PTGS1 MPV17 SGMS1 AGTR1
# 3 RAD51C XRCC3 XRCC2 XRCC6 ISG20 PRIM1
# 4 GNPDA1 GNE CSGALNACT1 CHST2 CHST4 CHST5
# 5 USE1 RNASEH1 RNF217 ISG20 CDKN2A CPA2
# 6 SYNCRIP MED24 RORB MED23 REST MED21
nrow(y)
# [1] 825
对于那些不想下载其他文件进行试用的人来说,这是一个最小的例子。
创建一个6行CSV文件,其中最后一行包含的字段多于前5行,并尝试在其上使用read.table
:
cat("1,2,3,4", "1,2,3,4", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4,5", file = "test1.txt",
sep = "\n")
read.table("test1.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 1 2 3 4
# 3 1 2 3 4
# 4 1 2 3 4
# 5 1 2 3 4
# 6 1 2 3 4
# 7 5 NA NA NA
注意与最长行是否在文件的前五行中的区别:
cat("1,2,3,4", "1,2,3,4,5", "1,2,3,4", "1,2,3,4",
"1,2,3,4", "1,2,3,4", file = "test2.txt",
sep = "\n")
read.table("test2.txt", header = FALSE, sep = ",", fill = TRUE)
# V1 V2 V3 V4 V5
# 1 1 2 3 4 NA
# 2 1 2 3 4 5
# 3 1 2 3 4 NA
# 4 1 2 3 4 NA
# 5 1 2 3 4 NA
# 6 1 2 3 4 NA
要解决此问题,我们使用count.fields
返回每行中检测到的字段数的向量。我们从中获取max
并将其传递给col.names
的{{1}}参数。
read.table