编程新手,已经发现了很多有用的线程,但并不是我所需要的。
我有一个文本文件,如下所示:
from tkinter import *
from tkinter import ttk
if __name__ == '__main__':
root = Tk()
style = ttk.Style()
#Defining style 1
style.configure('myStyle1.Treeview', rowheight=75)
#Defining style 2
style.configure('myStyle2.Treeview', rowheight=25)
tree = ttk.Treeview(root, style='myStyle1.Treeview')
tree.pack()
for i in range(5):
tree.insert(parent='',
index=END,
text='item {}'.format(i))
root.mainloop()
作为输出,我希望将新行中的每篇文章的正文(每个文章正文一个单元格)保存在一个文件中(我大约有5000篇文章要进行处理)。输出将是5000行和1列。 据我发现,似乎“ re”将是最佳解决方案。因此,重复出现的关键字是BODY:也许还有DOCUMENTS。如何为每篇文章将这些关键字之间的文本仅提取到excel中的新行中?
1 of 5000 DOCUMENTS
Copyright 2010 The Deal, L.L.C.
All Rights Reserved
Daily Deal/The Deal
January 12, 2010 Tuesday
HEADLINE: Cadbury slams Kraft bid
BODY:
On cue .....
......
body of article here
......
DEAL SIZE
$ 10-50 Billion
2 of 5000 DOCUMENTS
Copyright 2015 The Deal, L.L.C.
All Rights Reserved
The Deal Pipeline
September 17, 2015 Thursday
HEADLINE: Perrigo rejects formal offer from Mylan
BODY:
(and here again the body of this article)
DEAL SIZE
还是类似的东西?
import re
inputtext = 'F:\text.txt'
re.split(r'\n(?=BODY:)', inputtext)
我在哪里看有点迷茫,在此先感谢您!
编辑:感谢ewwink我目前在这里:
section = []
for line in open_file_object:
if line.startswith('BODY:'):
# new section
if section:
process_section(section)
section = [line]
else:
section.append(line)
if section:
process_section(section)
答案 0 :(得分:1)
使用文件with open('F:\text.txt', mode)
,其中mode
是'r'
用来阅读,'w'
是写,提取内容使用re.findall
,最后您需要转义\n
,双引号"
和其他字符。
import re
articlesBody = None
with open('text.txt', 'r') as txt:
inputtext = txt.read()
articlesBody = re.findall(r'BODY:(.+?)\d\sof\s5000', inputtext, re.S)
#print(articlesBody)
with open('result.csv', 'w') as csv:
for item in articlesBody:
item = item.replace('\n', '\\n').replace('"', '""')
csv.write('"%s",' % item)
另一注:尝试少量内容