我需要编写一个函数get_specified_words(filename)
,以从文本文件中获取小写单词的列表。必须满足以下所有条件:
-
或'
字符以及以'
结尾的字符
字符。-
结尾的单词。使用此正则表达式从文件的每个相关行中提取单词:valid_line_words = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line)
在使用正则表达式之前,请确保行字符串是小写字母。
打开文件进行读取时,请使用可选的编码参数。那就是您的打开文件调用应该看起来像open(filename,encoding ='utf-8')。如果您的操作系统未将Python的默认编码设置为UTF-8,这将特别有用。
示例文本文件testing.txt
包含以下内容:
That are after the start and should be dumped.
So should that
and that
and yes, that
*** START OF SYNTHETIC TEST CASE ***
Toby's code was rather "interesting", it had the following issues: short,
meaningless identifiers such as n1 and n; deep, complicated nesting;
a doc-string drought; very long, rambling and unfocused functions; not
enough spacing between functions; inconsistent spacing before and
after operators, just like this here. Boy was he going to get a low
style mark.... Let's hope he asks his friend Bob to help him bring his code
up to an acceptable level.
*** END OF SYNTHETIC TEST CASE ***
This is after the end and should be ignored too.
Have a nice day.
这是我的代码:
import re
def stripped_lines(lines):
for line in lines:
stripped_line = line.rstrip('\n')
yield stripped_line
def lines_from_file(fname):
with open(fname, 'rt') as flines:
for line in stripped_lines(flines):
yield line
def is_marker_line(line, start='***', end='***'):
min_len = len(start) + len(end)
if len(line) < min_len:
return False
return line.startswith(start) and line.endswith(end)
def advance_past_next_marker(lines):
for line in lines:
if is_marker_line(line):
break
def lines_before_next_marker(lines):
valid_lines = []
for line in lines:
if is_marker_line(line):
break
valid_lines.append(re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line))
for content_line in valid_lines:
yield content_line
def lines_between_markers(lines):
it = iter(lines)
advance_past_next_marker(it)
for line in lines_before_next_marker(it):
yield line
def words(lines):
text = '\n'.join(lines).lower().split()
return text
def get_valid_words(fname):
return words(lines_between_markers(lines_from_file(fname)))
# This must be executed
filename = "valid.txt"
all_words = get_valid_words(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(all_words)))
print("word list:")
print("\n".join(all_words))
这是我的输出:
File "C:/Users/jj.py", line 45, in <module>
text = '\n'.join(lines).lower().split()
builtins.TypeError: sequence item 0: expected str instance, list found
这是预期的输出:
valid.txt loaded ok.
73 valid words found.
word list:
toby's
code
was
rather
interesting
it
had
the
following
issues
short
meaningless
identifiers
such
as
n
and
n
deep
complicated
nesting
a
doc-string
drought
very
long
rambling
and
unfocused
functions
not
enough
spacing
between
functions
inconsistent
spacing
before
and
after
operators
just
like
this
here
boy
was
he
going
to
get
a
low
style
mark
let's
hope
he
asks
his
friend
bob
to
help
him
bring
his
code
up
to
an
acceptable
level
我需要帮助使我的代码正常工作。任何帮助表示赞赏。
答案 0 :(得分:1)
lines_between_markers(lines_from_file(fname))
为您提供有效单词列表的列表。
因此,您只需将其展平:
def words(lines):
words_list = [w for line in lines for w in line]
return words_list
做到了。
但是我认为您应该检查程序的设计:
lines_between_markers应该只在标记之间产生线条,但是它做得更多。正则表达式应在此函数的结果上使用,而不是在函数内部使用。
您没有做的事情:
在使用正则表达式之前,请确保行字符串是小写字母。
打开文件进行读取时,请使用可选的编码参数。 那就是您的打开文件调用应该看起来像open(filename, encoding ='utf-8')。