我需要编写一个函数, get_words_from_file(filename),该函数返回小写单词的列表。您的函数应仅处理起点和终点标记线之间的线。单词的顺序应与文件中出现的顺序相同。 这是一个示例文本文件:baboosh.txt:
*** START OF TEST CASE ***
......list of sentences here.....
*** END OF TEST CASE ***
This is after the end and should be ignored too.
这是我想出的:
import re
from string import punctuation
def stripped_lines(lines):
for line in lines:
stripped_line = line.rstrip('\n')
yield stripped_line
def lines_from_file(fname):
with open(fname, 'rt') as flines:
for line in stripped_lines(flines):
yield line
def is_marker_line(line, start='***', end='***'):
'''
Marker lines start and end with the given strings, which may not
overlap. (A line containing just '***' is not a valid marker line.)
'''
min_len = len(start) + len(end)
if len(line) < min_len:
return False
return line.startswith(start) and line.endswith(end)
def advance_past_next_marker(lines):
'''
'''
for line in lines:
if is_marker_line(line):
break
def lines_before_next_marker(lines):
valid_lines = []
for line in lines:
if is_marker_line(line):
break
line.replace('"', '')
valid_lines.append(line)
for content_line in valid_lines:
yield content_line
def lines_between_markers(lines):
'''
Yields the lines between the first two marker lines.
'''
it = iter(lines)
advance_past_next_marker(it)
for line in lines_before_next_marker(it):
yield line
def words(lines):
text = '\n'.join(lines).lower().split()
return text
def get_words_from_file(fname):
return words(lines_between_markers(lines_from_file(fname)))
#This is the test code that must be executed
filename = "baboosh.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
print(word)
我的输出
我得到正确的单词列表。但是在打印时,我会出现标点符号,例如冒号,分号和句号。我不知道还有什么办法摆脱这些。
我该怎么做?
答案 0 :(得分:1)
使用(?:[+\-*@&/%^|^]|([*/><])\1)=
代替re.split
。如果您像这样设置编译后的正则表达式:
str.split
然后您可以使用以下方法拆分行:
splitter = re.compile('[ ;:".]')
这将返回不带标点的单词。