Question

编程新手，已经发现了很多有用的线程，但并不是我所需要的。
我有一个文本文件，如下所示：

from tkinter import *
from tkinter import ttk

if __name__ == '__main__':
    root = Tk()
    style = ttk.Style()

    #Defining style 1
    style.configure('myStyle1.Treeview', rowheight=75)
    #Defining style 2
    style.configure('myStyle2.Treeview', rowheight=25)

    tree = ttk.Treeview(root, style='myStyle1.Treeview')
    tree.pack()

    for i in range(5):
        tree.insert(parent='',
               index=END,
               text='item {}'.format(i))


    root.mainloop()

作为输出，我希望将新行中的每篇文章的正文（每个文章正文一个单元格）保存在一个文件中（我大约有5000篇文章要进行处理）。输出将是5000行和1列。据我发现，似乎“ re”将是最佳解决方案。因此，重复出现的关键字是BODY：也许还有DOCUMENTS。如何为每篇文章将这些关键字之间的文本仅提取到excel中的新行中？

  1 of 5000 DOCUMENTS


                    Copyright 2010 The Deal, L.L.C.
                          All Rights Reserved
                          Daily Deal/The Deal

                        January 12, 2010 Tuesday

HEADLINE: Cadbury slams Kraft bid

BODY:

  On cue .....

......

body of article here

......

DEAL SIZE

$ 10-50 Billion

                            2 of 5000 DOCUMENTS


                    Copyright 2015 The Deal, L.L.C.
                          All Rights Reserved
                           The Deal Pipeline

                      September 17, 2015 Thursday

HEADLINE: Perrigo rejects formal offer from Mylan

BODY: 
(and here again the body of this article)

DEAL SIZE

还是类似的东西？

import re
inputtext = 'F:\text.txt'
re.split(r'\n(?=BODY:)', inputtext)

我在哪里看有点迷茫，在此先感谢您！

编辑：感谢ewwink我目前在这里：

section = []
for line in open_file_object:
if line.startswith('BODY:'):
    # new section
    if section:
        process_section(section)
    section = [line]
else:
    section.append(line)
if section:
process_section(section)

Answer 1

使用文件with open('F:\text.txt', mode)，其中mode是'r'用来阅读，'w'是写，提取内容使用re.findall，最后您需要转义\n，双引号"和其他字符。

import re

articlesBody = None
with open('text.txt', 'r') as txt:
  inputtext = txt.read()
  articlesBody = re.findall(r'BODY:(.+?)\d\sof\s5000', inputtext, re.S)

#print(articlesBody)

with open('result.csv', 'w') as csv:
  for item in articlesBody:
    item = item.replace('\n', '\\n').replace('"', '""')
    csv.write('"%s",' % item)

另一注：尝试少量内容

Python：按关键字将文本拆分为excel行

1 个答案: