对于Django应用程序,如果我的资源与数据库中的匹配相关,我需要将字符串中所有出现的模式转换为链接。
现在,这是一个过程: - 我使用re.sub处理很长的文本字符串 - 当re.sub找到模式匹配时,它会运行一个函数来查找该模式是否与数据库中的条目匹配 - 如果匹配,则包裹链接包围匹配的链接。
问题是数据库上有时会有数百次点击。我希望能够做的是对数据库的单个批量查询。
那么:你能用Python中的正则表达式进行批量查找和替换吗?
供参考,这里是代码(对于好奇的,我正在查找的模式是用于法律引用):
def add_linked_citations(text):
linked_text = re.sub(r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))', create_citation_link, text)
return linked_text
def create_citation_link(match_object):
volume = None
reporter = None
page = None
if match_object.group("volume") not in [None, '']:
volume = match_object.group("volume")
if match_object.group("reporter") not in [None, '']:
reporter = match_object.group("reporter")
if match_object.group("page") not in [None, '']:
page = match_object.group("page")
if volume and reporter and page: # These should all be here...
# !!! Here's where I keep hitting the database
citations = Citation.objects.filter(volume=volume, reporter=reporter, page=page)
if citations.exists():
citation = citations[0]
document = citation.document
url = document.url()
return '<a href="%s">%s %s %s</a>' % (url, volume, reporter, page)
else:
return '%s %s %s' % (volume, reporter, page)
答案 0 :(得分:1)
很抱歉,如果这是显而易见的错误(没有人在4小时内提出建议令人担忧!),但为什么不搜索所有匹配项,对所有内容进行批量查询(一旦完成所有匹配就很容易),以及然后用结果字典调用sub(所以函数从dict中提取数据)?
你必须运行regexp两次,但似乎数据库访问是昂贵的部分。
答案 1 :(得分:1)
您可以使用返回匹配对象的finditer
通过单个regexp传递来执行此操作。
匹配对象有:
groupdict()
span()
group()
所以我建议你:
finditer
我通过组合Q(volume=foo1,reporter=bar2,page=baz3)|Q(volume=foo1,reporter=bar2,page=baz3)...
列表实现了数据库查找。可能有更有效的方法。
这是一个未经测试的实现:
from django.db.models import Q
from collections import namedtuple
Triplet = namedtuple('Triplet',['volume','reporter','page'])
def lookup_references(matches):
match_to_triplet = {}
triplet_to_url = {}
for m in matches:
group_dict = m.groupdict()
if any(not(x) for x in group_dict.values()): # Filter out matches we don't want to lookup
continue
match_to_triplet[m] = Triplet(**group_dict)
# Build query
unique_triplets = set(match_to_triplet.values())
# List of Q objects
q_list = [Q(**trip._asdict()) for trip in unique_triplets]
# Consolidated Q
single_q = reduce(Q.__or__,q_list)
for row in Citations.objects.filter(single_q).values('volume','reporter','page','url'):
url = row.pop('url')
triplet_to_url[Triplet(**row)] = url
# Now pair original match objects with URL where found
lookups = {}
for match, triplet in match_to_triplet.items():
if triplet in triplet_to_url:
lookups[match] = triplet_to_url[triplet]
return lookups
def interpolate_citation_matches(text,matches,lookups):
result = []
prev = m_start = 0
last = m_end = len(text)
for m in matches:
m_start, m_end = m.span()
if prev != m_start:
result.append(text[prev:m_start])
# Now check match
if m in lookups:
result.append('<a href="%s">%s</a>' % (lookups[m],m.group()))
else:
result.append(m.group())
if m_end != last:
result.append(text[m_end:last])
return ''.join(result)
def process_citations(text):
citation_regex = r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))'
matches = list(re.finditer(citation_regex,text))
lookups = lookup_references(matches)
new_text = interpolate_citation_matches(text,matches,lookups)
return new_text