我正在处理一个pandas数据帧,其中包含每行的大块纯文本。文本块具有以下格式:
Year 1
... (variable # of lines)
7. Stuff
... (variable # of lines, can be 0)
TOTAL Stuff
(single line, numeric)
... (variable # of lines)
Services
(single line)
... (variable # of lines)
Year 2
... (same format as prev)
<repeats for n years>
TOTAL
... (same format as years)
Justification
... (variable number of lines)
<repeat m times>
我正在尝试提取“7. Stuff”和“Justification”标题下的纯文本以及“TOTAL Stuff”的数值。我当前的代码基于换行符创建一个列表并迭代它们,但我觉得这样效率不高。它我目前的实现也仅在有一个周期的时候才有效 - &gt;总计 - &gt;理由(不是m)。
这是我的parse_text
功能。任何帮助使它更加“pythonic”或一般只是有效的非常感谢。
def parse_budget_text(row):
stuff_value = 0
stuff_text = ''
justification_txt = ''
#ensure text is not hidden within a list
text = row['text_raw']
#parse and sum equipment lines
line_iter = iter([line.strip() for line in text.split("\n")])
total_flag = False
justification_flag = False
for line in line_iter:
#find each yearly section
if line.startswith("YEAR"):
while not line.startswith("7. Stuff"):
line = next(line_iter)
line = next(line_iter)
while not line.startswith("Services"):
if ("TOTAL Stuff" not in line) and (not is_number(line)) and (line[0] != "$"):
stuff_txt += line+'; '
line = next(line_iter)
#find total summary
elif line.startswith("TOTAL"):
cumulative_flag = True
while not line.startswith("TOTAL Stuff"):
line =next(line_iter)
stuff_value += int(next(line_iter).replace(',',''))
#find Justification line
elif line.startswith("Justification") and cumulative_flag:
justification_flag = True
#extract justification
elif justification_flag == True:
justification_txt += line
return pd.Series({'raw_text': text, 'Stuff_val': stuff_value, 'Stuff_txt': stuff_txt,})