这是我的输入文件:
THIS IS A TITLE
1. THIS IS A SUBTITLE
This is body text.
This is body text.
This is body text.
This is body text.
THIS IS A TITLE
This is body text.
THIS IS A TITLE
1. THIS IS A SUBTITLE
2. THIS IS A SUBTITLE
This is body text.
This is body text.
我想创建一个简单的标题列表,但不是字幕或正文。我怎么做?到目前为止,我想过循环遍历文件,抓住该行isupper()
,但是也抓住了字幕。 isalpha()
拒绝该行中包含空格的所有标题,因此不起作用。我能做什么?我更喜欢循环而不是正则表达式。
答案 0 :(得分:1)
如果没有正则表达式,您可以这样做:
# Read the file in as a single string, with all the newlines intact.
with open('file.txt', 'r') as f:
file_str = f.read()
# Split into paragraphs
paragraphs = file_str.split('\n\n')
titles = []
for p in paragraphs:
# Split a paragraph into lines, and get the first line of the paragraph
# (which is the title).
titles.append(p.split('\n')[0])
如果您将问题中提供的示例输入放入file.txt
,变量titles
将最终得到:
['THIS IS A TITLE', 'THIS IS A TITLE', 'THIS IS A TITLE']
答案 1 :(得分:1)
在您阅读文件后,这是一个单行内容:
INPUT(如果读为一个字符串):
output = [t for t in [i for i in s.split('\n') if all(j.isupper() for j in i.split())] if t!='']
INPUT(如果作为具有单独行的文件读取):
output = [t for t in [i for i in lines if all(j.isupper() for j in i.split())] if t!='']
输出:
['THIS IS A TITLE', 'THIS IS A TITLE', 'THIS IS A TITLE']
答案 2 :(得分:0)
您可以逐行读取文件到列表中,然后使用正则表达式:
import re
data = filter(None, [i.strip('\n') for i in open('filename.txt')])
new_data = [i for i in data if re.findall('^[A-Z\s]+$', i)]
输出:
['THIS IS A TITLE', 'THIS IS A TITLE', 'THIS IS A TITLE']