我正在研究Project Euler问题(为了好玩)。 它附带一个 46kb txt文件,其中包含1行,其中包含超过 5000 名称的列表,格式如下:
"MARIA","SUSAN","ANGELA","JACK"...
我的计划是编写一个方法来提取每个名称并将它们附加到Python列表中。正则表达式是解决这个问题的最佳武器吗? 我查找了Python 重新文档,但我很难搞清楚正确的正则表达式。
答案 0 :(得分:3)
这看起来像csv模块有用的格式。那你就不用写任何正则表达式了。
答案 1 :(得分:3)
如果文件的格式如你所说的那样,即
然后这应该有效:
>>> import csv >>> lines = csv.reader(open('words.txt', 'r'), delimiter=',') >>> words = lines.next() >>> words ['MARIA', 'SUSAN', 'ANGELA', 'JACK']
答案 2 :(得分:1)
正则表达式将完成工作,但效率低下。使用csv会起作用,但它可能无法在单行中处理5000个单元格。至少它必须加载整个文件并在内存中维护整个名称列表(这对您来说可能不是问题,因为这是非常少量的数据)。如果你想要一个相对较大的文件(远大于5000个名字)的迭代器,一个状态机就可以解决这个问题:
def parse_chunks(iter, quote='"', delim=',', escape='\\'):
in_quote = False
in_escaped = False
buffer = ''
for chunk in iter:
for byte in chunk:
if in_escaped:
# Done with the escape char, add it to the buffer
buffer += byte
in_escaped = False
elif byte == escape:
# The next charachter will be added literally and not parsed
in_escaped = True
elif in_quote:
if byte == quote:
in_quote = False
else:
buffer += byte
elif byte == quote:
in_quote = True
elif byte in (' ', '\n', '\t', '\r'):
# Ignore whitespace outside of quotes
pass
elif byte == delim:
# Done with this block of text
yield buffer
buffer = ''
else:
buffer += byte
if in_quote:
raise ValueError('Found unbalanced quote char %r' % quote)
elif in_escaped:
raise ValueError('Found unbalanced escape char %r' % escape)
# Yield the last bit in the buffer
yield buffer
data = r"""
"MARIA","SUSAN",
"ANG
ELA","JACK",,TED,"JOE\""
"""
print list(parse_chunks(data))
# ['MARIA', 'SUSAN', 'ANG\nELA', 'JACK', '', 'TED', 'JOE"']
# Use a fixed buffer size if you know the file has only one long line or
# don't care about line parsing
buffer_size = 4096
with open('myfile.txt', 'r', buffer_size) as file:
for name in parse_chunks(file):
print name
答案 3 :(得分:1)
如果你能做得更简单,那就更简单了。无需使用csv模块。我不认为5000个名字或46KB足以令人担心。
names = []
f = open("names.txt", "r")
# In case there is more than one line...
for line in f.readlines():
names = [x.strip().replace('"', '') for x in line.split(",")]
print names
#should print ['name1', ... , ...]