从具有各种可变长度部分的多行字符串中提取行

时间:2016-07-25 21:38:18

标签: python arrays string

我正在处理一个pandas数据帧,其中包含每行的大块纯文本。文本块具有以下格式:

Year 1
... (variable # of lines)
7. Stuff
... (variable # of lines, can be 0)
TOTAL Stuff
(single line, numeric)
... (variable # of lines)
Services 
(single line)
... (variable # of lines)
Year 2
... (same format as prev)
<repeats for n years>
TOTAL
... (same format as years)
Justification
... (variable number of lines)
<repeat m times>

我正在尝试提取“7. Stuff”和“Justification”标题下的纯文本以及“TOTAL Stuff”的数值。我当前的代码基于换行符创建一个列表并迭代它们,但我觉得这样效率不高。它我目前的实现也仅在有一个周期的时候才有效 - &gt;总计 - &gt;理由(不是m)。

这是我的parse_text功能。任何帮助使它更加“pythonic”或一般只是有效的非常感谢。

def parse_budget_text(row):
    stuff_value = 0
    stuff_text = ''
    justification_txt = ''
    #ensure text is not hidden within a list
    text = row['text_raw']
    #parse and sum equipment lines
    line_iter = iter([line.strip() for line in text.split("\n")])
    total_flag = False
    justification_flag = False
    for line in line_iter:
        #find each yearly section
        if line.startswith("YEAR"):
            while not line.startswith("7.  Stuff"):
                line = next(line_iter)
            line = next(line_iter)
            while not line.startswith("Services"):
                if ("TOTAL Stuff" not in line) and (not is_number(line))     and (line[0] != "$"):
                        stuff_txt += line+'; '
                line = next(line_iter)
        #find total summary
        elif line.startswith("TOTAL"):
            cumulative_flag = True
            while not line.startswith("TOTAL Stuff"):
                line =next(line_iter)
            stuff_value += int(next(line_iter).replace(',',''))
        #find Justification line
        elif line.startswith("Justification") and cumulative_flag:
            justification_flag = True
        #extract justification
        elif justification_flag == True:
            justification_txt += line                
    return pd.Series({'raw_text': text, 'Stuff_val': stuff_value, 'Stuff_txt': stuff_txt,})

0 个答案:

没有答案