已提供解决方案 - 谢谢@ekhumoro! 我有一个python字典,其中包含一个术语列表值:
myDict = {
ID_1: ['(dog|cat[a-z+]|horse)', '(car[a-z]+|house|apple\w)', '(bird|tree|panda)'],
ID_2: ['(horse|building|computer)', '(panda\w|lion)'],
ID_3: ['(wagon|tiger|cat\w*)'],
ID_4: ['(dog)']
}
我希望能够读取每个值中的列表项,作为单独的正则表达式,如果它们匹配任何文本,则将匹配的文本作为单词字典中的键返回,并使用其原始键(ID) )作为价值观。 因此,如果这些术语被读作搜索此字符串的正则表达式:
"dog panda cat cats pandas car carts"
我想到的一般方法是:
For key, value in myDict:
for item in value:
if re.compile(item) = match-in-text:
newDict[match] = [list of keys]
预期输出为:
newDict = {
car: [ID_1],
carts: [ID_1],
dog: [ID_1, ID_4],
panda: [ID_1, ID_2],
pandas: [ID_1, ID_2],
cat: [ID_1, ID_3],
cats: [ID_1, ID_3]
}
匹配的文字应该在newDict 中作为关键字返回,只有他们实际上匹配了文本正文中的内容。所以在输出中,' Carts'因为ID_1的正则表达式与它匹配,所以列在那里。因此ID列在输出字典中。 的解
import re
from collections import defaultdict
text = """
the eye of the tiger
a doggies in the manger
the cat in the hat
a kingdom for my horse
a bird in the hand
the cationic cataclysm
the pandamonious panda pandas
"""
myDict = {
'ID_1': ['(dog\w+|cat\w+|horse)', '(car|house|apples)',
'(bird|tree|panda\w+)'],
'ID_2': ['(horse|building|computer)', '(panda\w+|lion)'],
'ID_3': ['(wagon|tiger|cat)'],
'ID_4': ['(dog)'],
}
newDict = defaultdict(list)
for key, values in myDict.items():
for pattern in values:
for match in re.finditer(pattern, text):
newDict[match.group(0)].append(key)
for item in newDict.items():
print(item)
答案 0 :(得分:2)
这是一个似乎符合您要求的简单脚本:
import re
from collections import defaultdict
text = """
the eye of the tiger
a dog in the manger
the cat in the hat
a kingdom for my horse
a bird in the hand
"""
myDict = {
'ID_1': ['(dog|cat|horse)', '(car|house|apples)', '(bird|tree|panda)'],
'ID_2': ['(horse|building|computer)', '(panda|lion)'],
'ID_3': ['(wagon|tiger|cat)'],
'ID_4': ['(dog)'],
}
newDict = defaultdict(list)
for key, values in myDict.items():
for pattern in values:
for match in re.finditer(pattern, text):
newDict[match.group(0)].append(key)
for item in newDict.items():
print(item)
输出:
('dog', ['ID_1', 'ID_4'])
('cat', ['ID_1', 'ID_3'])
('horse', ['ID_1', 'ID_2'])
('bird', ['ID_1'])
('tiger', ['ID_3'])
答案 1 :(得分:1)
一种方法是将正则表达式转换为vanilla列表,例如用字符串操作:
In [11]: {id_: "|".join(ls).replace("(", "").replace(")", "").split("|") for id_, ls in myDict.items()}
Out[11]:
{'ID_1': ['dog',
'cat',
'horse',
'car',
'house',
'apples',
'bird',
'tree',
'panda'],
'ID_2': ['horse', 'building', 'computer', 'panda', 'lion'],
'ID_3': ['wagon', 'tiger', 'cat'],
'ID_4': ['dog']}
您可以将其转换为DataFrame:
In [12]: from collections import Counter
In [13]: pd.DataFrame({id_:Counter( "|".join(ls).replace("(", "").replace(")", "").split("|") ) for id_, ls in myDict.items()}).fillna(0).astype(int)
Out[13]:
ID_1 ID_2 ID_3 ID_4
apples 1 0 0 0
bird 1 0 0 0
building 0 1 0 0
car 1 0 0 0
cat 1 0 1 0
computer 0 1 0 0
dog 1 0 0 1
horse 1 1 0 0
house 1 0 0 0
lion 0 1 0 0
panda 1 1 0 0
tiger 0 0 1 0
tree 1 0 0 0
wagon 0 0 1 0