您好我有以下数据:
hello this is a car
<hamburguer>this car is very good<\hamburguer>I want to fill this rules
this pencil is red and very good, the movie was very fine
<red>the color is blue and green<\red>
<blue>your favorite color is the yellow<\blue>you want this<red>my smartphone is very expensive<\red>
从这些数据我得到如下列表:
lines = ['hello this is a car','<hamburguer>this car is very good<\hamburguer>I want to fill this rules','this pencil is red and very good, the movie was very fine','<red>the color is blue and green<\red>','<blue>your favorite color is the yellow<\blue>you want this<red>my smartphone is very expensive<\red>']
我想从这个列表中构建以下字典,这是我期望的输出:
dict_tags = {<hamburguer>:['this car is very good'],<red>:['the color is blue and green','my smartphone is very expensive'],<blue>:['your favorite color is the yellow']}
由于我不知道如何继续,我尝试了以下内容:
for line in lines:
pattern = re.search(r"(?<=>)(.*)(?=<)",line)
if pattern:
list_tags.append(pattern.group())
然而问题是我刚刚得到:
['this car is very good', 'the color is blue and green', 'your favorite color is the yellow<\x08lue>you want this<red>my smartphone is very expensive']
所以我需要支持来构建我需要的字典,感谢支持,我需要标签之间的数据,例如:
<red>the color is blue and green<\red>
我需要提取标签:
<red>
和信息:
the color is blue and green
答案 0 :(得分:2)
使用 re.findall()
功能和 collections.defaultdict
对象:
import re, collections
s = '''hello this is a car
<hamburguer>this car is very good<\\hamburguer>I want to fill this rules
this pencil is red and very good, the movie was very fine
<red>the color is blue and green<\\red>
<blue>your favorite color is the yellow<\\blue>you want this<red>my smartphone is very expensive<\\red>'''
tags_dict = collections.defaultdict(list)
tags = re.findall(r'<([^>]+)>([^<>]+)(<\\\1>)', s) # find all tags
for tag_open, value, tag_close in tags:
tags_dict[tag_open].append(value) # accumulate values for the same tag
print(dict(tags_dict))
输出:
{'hamburguer': ['this car is very good'], 'red': ['the color is blue and green', 'my smartphone is very expensive'], 'blue': ['your favorite color is the yellow']}
答案 1 :(得分:1)
仅使用re.finditer
。
正则表达式:<([^>]+)>([^>]+)<\\\1>
lst = {}
for item in re.finditer(r'<([^>]+)>([^>]+)<\\\1>', input):
lst.setdefault('<%s>' % item.group(1),[]).append(item.group(2))
输出:
{'<red>': ['the color is blue and green', 'my smartphone is very expensive'], '<blue>': ['your favorite color is the yellow'], '<hamburguer>': ['this car is very good']}