我有很长的汽车广告标题列表,还有所有汽车品牌和型号的另一个列表,我正在搜索标题以在品牌/型号列表中找到匹配项。到目前为止,我已经知道了:
for make in carmakes:
if make in title:
return make
但是它的效果不是很好,因为标题是人为制作的,并且有很多变化。 例如,如果标题是“ Nissan D-Max”,而我的品牌/型号列表中有“ dmax”,则循环不会捕获到它,因为它不完全匹配。 “松散”或“动态”检查匹配的最佳方法是什么?
答案 0 :(得分:3)
一旦遇到类似的挑战,下面是简化的解决方案:
import re
def re_compile(*args, flags: int =re.IGNORECASE, **kwargs):
return re.compile(*args, flags=flags, *kwargs)
class Term(object):
""""""
def __init__(self, contain_patterns, *contain_args):
self.matching_rules = []
self.forbid_rules = []
if isinstance(contain_patterns, str):
self.may_contain(contain_patterns, *contain_args)
else:
for cp in contain_patterns:
self.may_contain(cp, *contain_args)
def __eq__(self, other):
return isinstance(other, str) and self.is_alias(other)
def is_alias(self, s: str):
return (
all(not f_rule(s) for f_rule in self.forbid_rules) and
any(m_rule(s) for m_rule in self.matching_rules)
)
def matching_rule(self, f):
self.matching_rules.append(f)
return f
def forbid_rule(self, f):
self.forbid_rules.append(f)
return f
def must_rule(self, f):
self.forbid_rules.append(lambda s: not f(s))
return f
def may_be(self, *re_fullmatch_args):
self.matching_rules.append(re_compile(*re_fullmatch_args).fullmatch)
def must_be(self, *re_fullmatch_args):
fmatch = re_compile(*re_fullmatch_args).fullmatch
self.forbid_rules.append(lambda s: not fmatch(s))
def must_not_be(self, *re_fullmatch_args):
self.forbid_rules.append(re_compile(*re_fullmatch_args).fullmatch)
def may_contain(self, *re_search_args):
self.matching_rules.append(re_compile(*re_search_args).search)
def must_not_contain(self, *re_search_args):
self.forbid_rules.append(re_compile(*re_search_args).search)
def may_starts_with(self, *re_match_args):
self.matching_rules.append(re_compile(*re_match_args).match)
def must_not_starts_with(self, *re_match_args):
self.forbid_rules.append(re_compile(*re_match_args).match)
在您的情况下,每个car_model
都应表示为具有自正则表达式规则的Term
实例(我对汽车品牌不太了解,我发明了一些名称):
if __name__ == '__main__':
dmax = Term((r'd[ -._\'"]?max', r'Nissan DM'))
dmax.may_contain(r'nissan\s+last\s+(year)?\s*model')
dmax.must_not_contain(r'Skoda')
dmax.must_not_contain(r'Volkswagen')
@dmax.matching_rule
def dmax_check(s):
return re.search(r'double\s+max', s, re.IGNORECASE) and re.search(r'nissan', s, re.IGNORECASE)
tg = Term(r'Tiguan')
octav = Term(r'Octavia')
titles = (
'Dmax model',
'd_Max nissan',
'Nissan Double Max Pro',
'nissan last model',
'Skoda octavia',
'skoda d-max',
'Nissan Qashqai',
'VW Polo double max'
)
您的示例:
for car_model in (dmax, tg, octav):
print(car_model in titles)
结果:
True
False
True
详细信息:
print(' '*26, 'DMAX TIGUAN OCTAVIA')
for title in titles:
print(title.ljust(26), (dmax == title), (tg == title), (octav == title))
结果:
DMAX TIGUAN OCTAVIA
Dmax model True False False
d_Max nissan True False False
Nissan Double Max Pro True False False
nissan last model True False False
Skoda octavia False False True
skoda d-max False False False
Nissan Qashqai False False False
VW Polo double max False False False