此问题与以下问题有关:
How to parse tab-delimited data (of different formats) into a data.table/data.frame?
我有一个格式错误的文本文件,其中制表符分隔格式如下:
A 1092 - 1093 + 1X
B 1093 HRDCPMRFYT
A 1093 + 1094 - 1X
B 1094 BSZSDFJRVF
A 1094 + 1095 + 1X
B 1095 SSTFCLEPVV
...
但是,文本文件中有几个 long 行,它们在技术上以制表符分隔,但是是长字符串。例如这里的行'Z'和'Y'
Z FX:E:4.2
Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M
A 1092 - 1093 + 1X
B 1093 HRDCPMRFYT
A 1093 + 1094 - 1X
B 1094 BSZSDFJRVF
A 1094 + 1095 + 1X
B 1095 SSTFCLEPVV
...
此文本文件中有一部分Y 23434M,23434M,...
可能长达数GB。
这些行极为罕见,仅由前面的Z
或Y
标记。我目前在文本编辑器中打开文件并删除了这些行。
但是,这在算法上并不合理。有没有办法解析这个文件,以便(1)只使用行A
和B
或(2)显式不使用行Z
和Y
?
编辑:为了澄清,Z是不是一个长字符串。这里只有'Y'是一个长串。是一个格式为X XX:X:0.0
的字符串,其中X
是一个字符,0
是一个整数。
答案 0 :(得分:3)
您可以进行系统调用,以便按照某种模式使用,例如sed
来修复文件。如果您要删除以Z
或Y
开头的所有行,您只需传递正则表达式后跟/d
system("sed -i '/^[ZY]/d' test.tab")
上面的命令将删除您文件中以Z或Y开头的所有行。然后,您可以运行我在上一个问题中发布的相同代码
library(data.table)
fread("sed '$!N;s/\\n/ /' test.tab")
# V1 V2 V3 V4 V5 V6 V7 V8
# 1: A 1092 - 1093 + 1X B 1093 HRDCPMRFYT
# 2: A 1093 + 1094 - 1X B 1094 BSZSDFJRVF
# 3: A 1094 + 1095 + 1X B 1095 SSTFCLEPVV
数据
text <- "Z FX:E:4.2
Y 23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M,23434M
A 1092 - 1093 + 1X
B 1093 HRDCPMRFYT
A 1093 + 1094 - 1X
B 1094 BSZSDFJRVF
A 1094 + 1095 + 1X
B 1095 SSTFCLEPVV"
# Saving it as tab separated file on disk
write(gsub(" +", "\t", text), file = "test.tab")