Python:按关键字将文本拆分为excel行

时间:2018-11-08 12:58:52

标签: python regex text extract sentiment-analysis

编程新手,已经发现了很多有用的线程,但并不是我所需要的。
我有一个文本文件,如下所示:

from tkinter import *
from tkinter import ttk

if __name__ == '__main__':
    root = Tk()
    style = ttk.Style()

    #Defining style 1
    style.configure('myStyle1.Treeview', rowheight=75)
    #Defining style 2
    style.configure('myStyle2.Treeview', rowheight=25)

    tree = ttk.Treeview(root, style='myStyle1.Treeview')
    tree.pack()

    for i in range(5):
        tree.insert(parent='',
               index=END,
               text='item {}'.format(i))


    root.mainloop()   

作为输出,我希望将新行中的每篇文章的正文(每个文章正文一个单元格)保存在一个文件中(我大约有5000篇文章要进行处理)。输出将是5000行和1列。 据我发现,似乎“ re”将是最佳解决方案。因此,重复出现的关键字是BODY:也许还有DOCUMENTS。如何为每篇文章将这些关键字之间的文本仅提取到excel中的新行中?

  1 of 5000 DOCUMENTS


                    Copyright 2010 The Deal, L.L.C.
                          All Rights Reserved
                          Daily Deal/The Deal

                        January 12, 2010 Tuesday

HEADLINE: Cadbury slams Kraft bid

BODY:

  On cue .....

......

body of article here

......

DEAL SIZE

$ 10-50 Billion

                            2 of 5000 DOCUMENTS


                    Copyright 2015 The Deal, L.L.C.
                          All Rights Reserved
                           The Deal Pipeline

                      September 17, 2015 Thursday

HEADLINE: Perrigo rejects formal offer from Mylan

BODY: 
(and here again the body of this article)

DEAL SIZE

还是类似的东西?

import re
inputtext = 'F:\text.txt'
re.split(r'\n(?=BODY:)', inputtext)

我在哪里看有点迷茫,在此先感谢您!

编辑:感谢ewwink我目前在这里:

section = []
for line in open_file_object:
if line.startswith('BODY:'):
    # new section
    if section:
        process_section(section)
    section = [line]
else:
    section.append(line)
if section:
process_section(section)

1 个答案:

答案 0 :(得分:1)

使用文件with open('F:\text.txt', mode),其中mode'r'用来阅读,'w'是写,提取内容使用re.findall,最后您需要转义\n,双引号"和其他字符。

import re

articlesBody = None
with open('text.txt', 'r') as txt:
  inputtext = txt.read()
  articlesBody = re.findall(r'BODY:(.+?)\d\sof\s5000', inputtext, re.S)

#print(articlesBody)

with open('result.csv', 'w') as csv:
  for item in articlesBody:
    item = item.replace('\n', '\\n').replace('"', '""')
    csv.write('"%s",' % item)

另一注:尝试少量内容