我有一个长字符串,如下所示:
s = 'label("id1","A") label("id1","B") label("id2", "C") label("id2","A") label("id2","D") label("id3","A")'
我想使用正则表达式根据id创建标签列表。
更清楚一点,从示例中的字符串s
开始,我想得到一个结果列表,如下所示:
[("id1", ["A","B"]),
("id2", ["C","A","D"]),
("id3", ["A"])]
使用正则表达式我设法获取id和元素:
import re
regex = re.compile(r'label\((\S*),(\S*)\)')
results = re.findall(regex,s)
使用此代码,results
看起来像:
[('"id1"', '"A"'),
('"id1"', '"B"'),
('"id2"', '"A"'),
('"id2"', '"D"'),
('"id3"', '"A"')]
是否有一种简单的方法可以从正则表达式中获取已正确分组的数据?
答案 0 :(得分:1)
您可以循环findall()
结果并在collections.defaultdict
object中收集它们。请调整正则表达式以不包括引号,并添加一些空格容差,但是:
from collections import defaultdict
import re
regex = re.compile(r'label\("([^"]*)",\s*"([^"]*)"\)')
results = defaultdict(list)
for id_, tag in regex.findall(s):
results[id_].append(tag)
print results.items()
如果您想要的只是唯一值,则可以将list
替换为set
,将append()
替换为add()
。
演示:
>>> from collections import defaultdict
>>> import re
>>> s = 'label("id1","A") label("id1","B") label("id2", "C") label("id2","A") label("id2","D") label("id3","A")'
>>> regex = re.compile(r'label\("([^"]*)",\s*"([^"]*)"\)')
>>> results = defaultdict(list)
>>> for id_, tag in regex.findall(s):
... results[id_].append(tag)
...
>>> results.items()
[('id2', ['C', 'A', 'D']), ('id3', ['A']), ('id1', ['A', 'B'])]
如果需要,您也可以对此结果进行排序。
答案 1 :(得分:0)
后处理结果是否可以接受?
如果是的话,
import re
# edited your regex to get rid of the extra quotes, and to allow for the possible space that occurs in label("id2", "C")
regex = re.compile(r'label\(\"(\S*)\",\ ?\"(\S*)\"\)')
results = re.findall(regex,s)
resultDict = {}
for id, val in results:
if id in resultDict:
resultDict[id].append(val)
else:
resultDict[id] = [val]
# if you really want a list of tuples rather than a dictionary:
resultList = resultDict.items()