用于循环迭代的Python,用于在一行中合并多行

时间:2014-01-15 18:49:53

标签: python parsing loops csv merge

我有一个我试图解析的CSV文件,但问题是其中一个单元格包含满是空值和换行符的数据块。我需要将数组中的每一行括起来并合并其对应行中此特定单元格的所有内容。我最近发布了类似的问题,答案部分地解决了我的问题,但是我在构建循环时遇到了问题,这个循环遍历了不符合特定启动条件的每一行。我合并的代码只是不符合该条件的第一行,但在此之后它会中断。

我有:

file ="myfile.csv"
condition = "DAT"

data = open(file).read().split("\n")
for i, line in enumerate(data):
    if not line.startswith(condition):
        data[i-1] = data[i-1]+line
        data.pop(i)
print data

对于如下所示的CSV:

Case  | Info
-------------------
DAT1    single line  
DAT2    "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria   syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.

Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.

Kraft met the young sports fan and attended the HBO premiere of the documentary in New    York in October. Kraft made a $500,000 matching pledge to the foundation.

The Boston Globe reported that Berns was invited to a Patriots practice that month, and gave the players an impromptu motivational speech.

DAT3    single line
DAT4    YWYWQIDOWCOOXXOXOOOOOOOOOOO 

它将完整句子与前一行连接起来。但当它击中双倍空格或双线时,它会失败并将其注册为新线。例如,如果我打印:

data[0]

输出结果为:

DAT1    single line

如果我打印:

data[1]

输出结果为:

DAT2    "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.

但如果我打印:

data[2]

输出结果为:

Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.

而不是:

DAT3    single line

如何在“Info”列上合并完整的文本文本,以便它始终与相应的DAT行匹配,而不是弹出作为新行,而不管空行或换行符号是什么?

2 个答案:

答案 0 :(得分:0)

在迭代时更改data是“坏”

new_data = []
for line in data:
    if not new_data or line.startswith(condition):
        new_data.append(line)
    else:
        new_data[-1] += line
print new_data

答案 1 :(得分:0)

您可以将带有正则表达式的行直接拆分为data

<强>的Python

import re

f = open("myfile.csv")
text = f.read()
data = re.findall("\n(DAT\d+.*)", text)

如果没有帮助,请纠正我。

<强>更新

我相信,这可以解决新行的问题:

import re

f = open("myfile.csv")
text = f.read()
lines = re.split(r"\n(?=DAT\d+)", text)
lines.pop(0)