使用未转义的换行符清除制表符分隔文件

时间:2013-10-30 06:04:06

标签: python regex r data-cleansing

我有一个制表符分隔的文件,其中一个列偶尔会有未换行的换行符(用引号括起来):

   JOB  REF Comment V2  Other
1   3   45  This was a small job    NULL    sdnsdf
2   4   456 This was a large job and I have to go onto a new line, 
    but I didn't properly escape so it's on the next row whoops!    NULL    NULL        
3   7   354 NULL    NULL    NULL

# dat <- readLines("the-Dirty-Tab-Delimited-File.txt")
dat <- c("\tJOB\tREF\tComment\tV2\tOther", "1\t3\t45\tThis was a small job\tNULL\tsdnsdf", 
"2\t4\t456\tThis was a large job and I have\t\t", "\t\"to go onto a new line, but I didn't properly escape so it's on the next row whoops!\"\tNULL\tNULL\t\t", 
"3\t7\t354\tNULL\tNULL\tNULL")

我知道这可能是不可能的,但这些不好的换行只发生在一个字段(第10列)中。我对R(最好)或python中的解决方案感兴趣。

我的想法是介绍一个正则表达式,在10和10个标签后查找换行符。我开始使用readLines并尝试删除空格+单词末尾出现的所有换行符:

dat <- gsub("( [a-zA-Z]*)\t\n", "\\1", dat)

但似乎很难扭转readLines的线结构。我该怎么办?

编辑:有时会出现两个换行符(即用户在注释字段中的段落之间放置一个空行。下面是一个示例(所需的结果是应该将其设置为单个)行)

140338  28855   WA  2   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    1   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    1000    NULL    NULL    NULL    NULL    NULL    NULL    YNNNNNNN    (Some text with two newlines)

The remainder of the text beneath two newlines  NULL    NULL    NULL    3534a   NULL    email   NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL

2 个答案:

答案 0 :(得分:1)

这是我在Python中的答案。

import re

# This pattern should match correct data lines and should not
# match "continuation" lines (lines added by the unquoted newline).
# This pattern means: start of line, then a number, then white space,
# then another number, then more white space, then another number.

# This program won't work right if this pattern isn't correct.
pat = re.compile("^\d+\s+\d+\s+\d+")

def collect_lines(iterable):
    itr = iter(iterable)  # get an iterator

    # First, loop until we find a valid line.
    # This will skip the first line with the "header" info.
    line = next(itr)
    while True:
        line = next(itr)
        if pat.match(line):
            # found a valid line; hold it as cur
            cur = line
            break
    for line in itr:
        # Look at the line after cur.  Is it a valid line?
        if pat.match(line):
            # Line after cur is valid!
            yield cur  # output cur
            cur = line  # hold new line as new cur
        else:
            # Line after cur is not valid; append to cur but do not output yet.
            cur = cur.rstrip('\r\n') + line
    yield cur

data = """\
   JOB  REF Comment V2  Other
@@@1   3   45  This was a small job    NULL    sdnsdf
@@@2   4   456 This was a large job and I have to go onto a new line, 
@@@    but I didn't properly escape so it's on the next row whoops!    NULL    NULL        
@@@3   7   354 NULL    NULL    NULL
"""

lines = data.split('@@@')
for line in collect_lines(lines):
    print(">>>{}<<<".format(line))

对于你的真实节目:

with open("filename", "rt") as f:
    for line in collect_lines(f):
        # do something with each line
编辑:我重写了这个并添加了更多评论。我也认为我解决了你所看到的问题。

当我加入一行到cur时,我没有先在cur的末尾删除换行符。因此,连接线仍然是一条分割线,当它被写入文件时,这并没有真正解决问题。现在试试吧。

我重新设计了测试数据,以便测试线上会有换行符。我的原始测试将输入拆分为换行符,因此拆分行不包含任何换行符。现在这些行将以换行符结束。

答案 1 :(得分:1)

不需要正则表达式。

with open("filename", "r") as data:
    datadict={}
    for count,linedata in enumerate(data):
        datadict[count]=linedata.split('\t')

extra_line_numbers=[]
for count,x in enumerate(datadict):
    if count==0: #get rid of the first line
        continue
    if not datadict[count][1].isdigit(): #if item #2 isn't a number
        datadict[count-1][3]=datadict[count-1][3]+datadict[count][1]
        datadict[count-1][4:6]=(datadict[count][2],datadict[count][3])
        extra_line_numbers.append(count)

for x in extra_line_numbers:
    del(datadict[x])

with open("newfile",'w') as data:
    data.writelines(['\t'.join(x)+'\n' for x in datadict.values()])