tidyverse:readr read_delim会出现选项卡和分号错误吗?

时间:2018-06-17 05:06:12

标签: r bioinformatics delimiter data-import readr

我有一个以制表符分隔的遗传变体文件,其中最后一个字段OtherInfo是由分号分隔的长字符串。不知怎的,这导致readr遇到错误,如下所示。这是预期的行为吗?我该如何解决这个问题?

非常感谢。

> head myanno_AllChr_ExAC38.hg38_multianno.txt

Chr Start       End         Ref Alt ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    Otherinfo
1   15847952    15847952    G   C   .   .   .   .   .   .   .   .   .   241.9   76196   1   15847952    .   G   C   241.9   PASS    AC=2;AF=0;AN=18332;BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406;culprit=MQ
1   15847963    15847963    A   C   .   .   .   .   .   .   .   .   .   1607.1  126156  1   15847963    .   A   C   1607.1  PASS    AC=2;AF=0;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=2;MLEAF=0;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995;culprit=QD
1   15847964    15847966    GCC -   .   .   .   .   .   .   .   .   .   1607.1  126156  1   15847963    .   AGCC    A   1607.1  PASS    AC=63;AF=0.003;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=55;MLEAF=0.002;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995;culprit=QD
1   15847978    15847978    C   T   .   .   .   .   .   .   .   .   .   648.41  234344  1   15847978    .   C   T   648.41  PASS    AC=9;AF=0;AN=25894;BaseQRankSum=-0.572;ClippingRankSum=-0.404;DP=234344;ExcessHet=3.348;FS=2.639;InbreedingCoeff=-0.0098;MLEAC=6;MLEAF=0;MQ=58.71;MQRankSum=-0.456;NEGATIVE_TRAIN_SITE;QD=4.13;ReadPosRankSum=-0.456;SOR=0.452;VQSLOD=-1.238;culprit=QD
1   15847979    15847979    G   T   .   .   .   .   .   .   .   .   .   315.48  243578  1   15847979    .   G   T   315.48  PASS    AC=1;AF=0;AN=26062;BaseQRankSum=0.301;ClippingRankSum=0.356;DP=243578;ExcessHet=3.1213;FS=0;InbreedingCoeff=-0.0072;MLEAC=1;MLEAF=0;MQ=58.83;MQRankSum=-1.505;QD=12.62;ReadPosRankSum=0.684;SOR=0.495;VQSLOD=-0.1437;culprit=MQRankSum

运行以下命令:

variant.freqs <- read_tsv("AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt")

返回:

Parsed with column specification:
cols(
  Chr = col_integer(),
  Start = col_integer(),
  End = col_integer(),
  Ref = col_character(),
  Alt = col_character(),
  ExAC_ALL = col_character(),
  ExAC_AFR = col_character(),
  ExAC_AMR = col_character(),
  ExAC_EAS = col_character(),
  ExAC_FIN = col_character(),
  ExAC_NFE = col_character(),
  ExAC_OTH = col_character(),
  ExAC_SAS = col_character(),
  Otherinfo = col_character()
)

以下是erorr:

number of columns of result is not a multiple of vector length (arg 1)152306 parsing failures.
row # A tibble: 5 x 5 col     row col   expected   actual     file                                                                                           expected   <int> <chr> <chr>      <chr>      <chr>                                                                                          actual 1     1 NA    14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt' 
file 2     2 NA    14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt' 
row 3     3 NA    14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt' 
col 4     4 NA    14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt' 
expected 5     5 NA    14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'


View(variant.freqs)

enter image description here

2 个答案:

答案 0 :(得分:4)

从示例数据中,第一行有14个标签,第二行有24个标签 - 您的列数没有足够的标题

> fl = "foo.txt"
> lengths(strsplit(readLines(fl, 2), "\t"))
[1] 14 24

更详细

> res = strsplit(readLines(fl, 2), "\t")
> res[[1]][14]      # first line, final header
[1] "Otherinfo"
> res[[2]][14]      # second line, entry in position 14
[1] "."
> res[[2]][15]      # second line, entry in position 15
[1] "241.9"
> res[[2]][24]      # second line, entry in position 24
[1] "AC=2;AF=0;AN=18332;BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406;culprit=MQ"

答案 1 :(得分:0)

这是一个使用read.table的工作,但目前正在运作:

test <- read.table(textConnection(gsub(",", "\t", readLines("C:/Users/.../Desktop/test.txt"))))

library(tidyr)
test_final <- test %>% 
              separate(V24, paste0("V24_",1:19), ";")