我有一个以制表符分隔的遗传变体文件,其中最后一个字段OtherInfo
是由分号分隔的长字符串。不知怎的,这导致readr
遇到错误,如下所示。这是预期的行为吗?我该如何解决这个问题?
非常感谢。
> head myanno_AllChr_ExAC38.hg38_multianno.txt
Chr Start End Ref Alt ExAC_ALL ExAC_AFR ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS Otherinfo
1 15847952 15847952 G C . . . . . . . . . 241.9 76196 1 15847952 . G C 241.9 PASS AC=2;AF=0;AN=18332;BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406;culprit=MQ
1 15847963 15847963 A C . . . . . . . . . 1607.1 126156 1 15847963 . A C 1607.1 PASS AC=2;AF=0;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=2;MLEAF=0;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995;culprit=QD
1 15847964 15847966 GCC - . . . . . . . . . 1607.1 126156 1 15847963 . AGCC A 1607.1 PASS AC=63;AF=0.003;AN=22004;BaseQRankSum=0.851;ClippingRankSum=-0.419;DP=126156;ExcessHet=3.4904;FS=0;InbreedingCoeff=0.0299;MLEAC=55;MLEAF=0.002;MQ=59.29;MQRankSum=0.18;QD=1.55;ReadPosRankSum=0.067;SOR=0.651;VQSLOD=0.995;culprit=QD
1 15847978 15847978 C T . . . . . . . . . 648.41 234344 1 15847978 . C T 648.41 PASS AC=9;AF=0;AN=25894;BaseQRankSum=-0.572;ClippingRankSum=-0.404;DP=234344;ExcessHet=3.348;FS=2.639;InbreedingCoeff=-0.0098;MLEAC=6;MLEAF=0;MQ=58.71;MQRankSum=-0.456;NEGATIVE_TRAIN_SITE;QD=4.13;ReadPosRankSum=-0.456;SOR=0.452;VQSLOD=-1.238;culprit=QD
1 15847979 15847979 G T . . . . . . . . . 315.48 243578 1 15847979 . G T 315.48 PASS AC=1;AF=0;AN=26062;BaseQRankSum=0.301;ClippingRankSum=0.356;DP=243578;ExcessHet=3.1213;FS=0;InbreedingCoeff=-0.0072;MLEAC=1;MLEAF=0;MQ=58.83;MQRankSum=-1.505;QD=12.62;ReadPosRankSum=0.684;SOR=0.495;VQSLOD=-0.1437;culprit=MQRankSum
运行以下命令:
variant.freqs <- read_tsv("AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt")
返回:
Parsed with column specification:
cols(
Chr = col_integer(),
Start = col_integer(),
End = col_integer(),
Ref = col_character(),
Alt = col_character(),
ExAC_ALL = col_character(),
ExAC_AFR = col_character(),
ExAC_AMR = col_character(),
ExAC_EAS = col_character(),
ExAC_FIN = col_character(),
ExAC_NFE = col_character(),
ExAC_OTH = col_character(),
ExAC_SAS = col_character(),
Otherinfo = col_character()
)
以下是erorr:
number of columns of result is not a multiple of vector length (arg 1)152306 parsing failures.
row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
file 2 2 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
row 3 3 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
col 4 4 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
expected 5 5 NA 14 columns 24 columns 'AlleleFrequencies_Populations/ExAC_annotation_allChr/myanno_AllChr_ExAC38.hg38_multianno.txt'
View(variant.freqs)
答案 0 :(得分:4)
从示例数据中,第一行有14个标签,第二行有24个标签 - 您的列数没有足够的标题
> fl = "foo.txt"
> lengths(strsplit(readLines(fl, 2), "\t"))
[1] 14 24
更详细
> res = strsplit(readLines(fl, 2), "\t")
> res[[1]][14] # first line, final header
[1] "Otherinfo"
> res[[2]][14] # second line, entry in position 14
[1] "."
> res[[2]][15] # second line, entry in position 15
[1] "241.9"
> res[[2]][24] # second line, entry in position 24
[1] "AC=2;AF=0;AN=18332;BaseQRankSum=0.731;ClippingRankSum=-0.731;DP=76196;ExcessHet=3.1;FS=0;InbreedingCoeff=-0.0456;MLEAC=2;MLEAF=0;MQ=38.93;MQRankSum=0.515;NEGATIVE_TRAIN_SITE;QD=10.52;ReadPosRankSum=0.89;SOR=0.481;VQSLOD=-1.406;culprit=MQ"
答案 1 :(得分:0)
这是一个使用read.table
的工作,但目前正在运作:
test <- read.table(textConnection(gsub(",", "\t", readLines("C:/Users/.../Desktop/test.txt"))))
library(tidyr)
test_final <- test %>%
separate(V24, paste0("V24_",1:19), ";")