Question

新编码并尝试弄清楚如何修复损坏的csv文件，以便能够正确使用它。

因此该文件已从案例管理系统导出，并包含用户名，个案，时间，笔记和日期等字段。

问题在于偶尔的注释中有新行，当导出csv时，工具不包含引号以将其定义为字段中的字符串。

见下面的例子：

user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;

我想连接第3,4和5行来显示以下内容： tnn; 125; 3;我写的评论包含没有引号的新行; 2017-11-28;

由于每行都以用户名（总是3个字母）开头，我以为我能够迭代这些行来查找哪些行不以用户名开头并将其与前一行连接。但它并没有按预期工作。

这是我到目前为止所得到的：

import re

with open('Rapp.txt', 'r') as f:

 for line in f:
  previous = line #keep current line in variable to join next line
  if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
   print(previous.join(line))

脚本显示没有输出只是默默完成，有什么想法吗？

Answer 1

我想我会采用稍微不同的方式：

import re

all_the_data = ""

with open('Rapp.txt', 'r') as f:
    for line in f:
        if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
            line = re.sub("\n", "", line)
        all_the_data = "".join([all_the_data, line])
print (all_the_data)

有几种方法可以做到这一点各有利弊，但我认为这样做很简单。

按照您的方式循环文件，如果该行没有以日期结束;取回回车并将其填入all_the_data。这样你就不必回头看'up'文件了。同样，有很多方法可以做到这一点。如果你宁愿使用3个字母和a的开头逻辑;回顾过去，这有效：

import re

all_the_data = ""

with open('Rapp.txt', 'r') as f:
    all_the_data = ""
    for line in f:
        if not re.search("^[A-Za-z]{3};", line):
            all_the_data = re.sub("\n$", "", all_the_data)
        all_the_data = "".join([all_the_data, line])

    print ("results:")
    print (all_the_data)

几乎要求的是什么。逻辑是如果当前行没有正确启动，从all_the_data中取出前一行的回车。

如果你需要帮助正在使用正则表达式本身，这个网站很棒：http://regex101.com

Answer 2

代码中的正则表达式匹配txt中的所有行（字符串）（找到与模式的有效匹配）。 if条件永远不会成立，因此没有打印。

with open('./Rapp.txt', 'r') as f:
    join_words = []

    for line in f:
        line = line.strip()
        if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
            print(';'.join(join_words)) 
            join_words = []
            join_words.append(line)
        else:
            join_words.append(line)

    print(";".join(join_words))

我试图在这里不使用正则表达式，以便尽可能保持清晰。但是，正则表达式是一个更好的选择。

Answer 3

一种简单的方法是使用充当原始文件过滤器的生成器。如果第4列中没有分号（;），那么该过滤器会将一行连接到前一行。代码可以是：

def preprocess(fd):
    previous = next(fd)
    for line in fd:
        if line[3] == ';':
            yield previous
            previous = line
        else:
            previous = previous.strip() + " " + line
    yield previous  # don't forget last line!

然后你可以使用：

with open(test.txt) as fd:
    rd = csv.DictReader(preprocess(fd))
    for row in rd:
        ...

这里的技巧是csv模块只需要在每次next函数应用时返回一行的对象，因此生成器是合适的。

但这只是一种解决方法，正确的方法是上一步直接生成正确的CSV文件。

根据第一列中的字母数将前一行的行连接起来

3 个答案: