我有一个这样的文本文件:
john123:
1
2
coconut_rum.zip
bob234513253:
0
jackdaniels.zip
nowater.zip
3
judy88009:
dontdrink.zip
9
tommi54321:
dontdrinkalso.zip
92
...
我有数百万条这样的条目。
我想拿起长度为5位数的名字和号码。我试过这个:
matches = re.findall(r'\w*\d{5}:',filetext2)
但是它给了我至少 5位数的结果。
['bob234513253:', 'judy88009:', 'tommi54321:']
Q1:如何找到完全 5位的名字?
Q2:我想将与这些名称相关联的zip文件附加到5位数。我如何使用正则表达式?
答案 0 :(得分:3)
这是因为\w
包含数字字符:
>>> import re
>>> re.match('\w*', '12345')
<_sre.SRE_Match object at 0x021241E0>
>>> re.match('\w*', '12345').group()
'12345'
>>>
你需要更具体,并告诉Python你只需要字母:
matches = re.findall(r'[A-Za-z]*\d{5}:',filetext2)
关于第二个问题,您可以使用以下内容:
import re
# Dictionary to hold the results
results = {}
# Break-up the file text to get the names and their associated data.
# filetext2.split('\n\n') breaks it up into individual data blocks (one per person).
# Mapping to str.splitlines breaks each data block into single lines.
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
# See if the name matches our pattern.
if re.match('[A-Za-z]*\d{5}:', name):
# Add the name and the relevant data to the file.
# [:-1] gets rid of the colon on the end of the name.
# The list comprehension gets only the file names from the data.
results[name[:-1]] = [x for x in data if x.endswith('.zip')]
或者,没有所有评论:
import re
results = {}
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
if re.match('[A-Za-z]*\d{5}:', name):
results[name[:-1]] = [x for x in data if x.endswith('.zip')]
以下是演示:
>>> import re
>> filetext2 = '''\
... john123:
... 1
... 2
... coconut_rum.zip
...
... bob234513253:
... 0
... jackdaniels.zip
... nowater.zip
... 3
...
... judy88009:
... dontdrink.zip
... 9
...
... tommi54321:
... dontdrinkalso.zip
... 92
... '''
>>> results = {}
>>> for name, *data in map(str.splitlines, filetext2.split('\n\n')):
... if re.match('[A-Za-z]*\d{5}:', name):
... results[name[:-1]] = [x for x in data if x.endswith('.zip')]
...
>>> results
{'tommi54321': ['dontdrinkalso.zip'], 'judy88009': ['dontdrink.zip']}
>>>
请记住,尽管一次读取所有文件的内容效率不高。相反,您应该考虑使用生成器函数一次生成一个数据块。此外,您可以通过预编译正则表达式模式来提高性能。
答案 1 :(得分:1)
import re
results = {}
with open('datazip') as f:
records = f.read().split('\n\n')
for record in records:
lines = record.split()
header = lines[0]
# note that you need a raw string
if re.match(r"[^\d]\d{5}:", header[-7:]):
# in general multiple hits are possible, so put them into a list
results[header] = [l for l in lines[1:] if l[-3:]=="zip"]
print results
{'tommi54321:': ['dontdrinkalso.zip'], 'judy88009:': ['dontdrink.zip']}
我试着保持它非常简单,如果您的输入很长,您应该按照iCodez的建议,实现一次yield
一条记录的生成器,而对于正则表达式匹配,我尝试了一点优化仅搜索标题的最后7个字符。
import re
def records(f):
record = []
for l in f:
l = l.strip()
if l:
record.append(l)
else:
yield record
record = []
yield record
results = {}
for record in records(open('datazip')):
head = record[0]
if re.match(r"[^\d]\d{5}:", head[-7:]):
results[head] = [ r for r in record[1:] if r[-3:]=="zip"]
print results
答案 2 :(得分:0)
您需要将正则表达式限制在单词的末尾,以便使用\b
[a-zA-Z]+\d{5}\b
请参阅示例http://regex101.com/r/oC1yO6/1
正则表达式匹配
judy88009:
tommi54321:
python代码就像
>>> re.findall(r'[a-zA-Z]+\d{5}\b', x)
['judy88009', 'tommi54321']