给定一个字符串和一个应该替换为占位符的子字符串列表,例如
import re
from copy import copy
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
第一个目标是首先使用索引占位符替换phrases
中original_text
的子字符串,例如:
text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[OUT]:
Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen
然后,有一些函数可以使用占位符操作text
,例如
cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)
输出:
MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2
最后一步是以倒退的方式进行替换,并放回原来的短语,即
' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])
[OUT]:
"'s_morgen ik 's-Hertogenbosch depository_financial_institution"
问题是:
phrases
中的子项列表很大,那么进行第一次替换和最后一次替换的时间将花费很长时间。 有没有办法用正则表达式进行替换/替换?
re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
正则表达式替换并非特别有用。如果短语中的子串不匹配完整的单词,E.g。
phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
我们得到一个尴尬的输出:
Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen
我已尝试使用'\b{}\b'.format(phrase)
,但对于带标点符号的短语不起作用,即
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[OUT]:
Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen
是否有一些地方可以表示re.sub
正则表达式模式中短语的单词边界?
答案 0 :(得分:2)
您可以拆分它而不是使用re.sub。
def do_something_with_str(string):
# do something with string here.
# for example let's wrap the string with "@" symbol if it's not empty
return f"@{string}" if string else string
def get_replaced_list(string, words):
result = [(string, True), ]
# we take each word we want to replace
for w in words:
new_result = []
# Getting each word in old result
for r in result:
# Now we split every string in results using our word.
split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])
# If we replace successfully - add all the strings
if len(split_list) > 1:
# This one would be for [text, replaced, text, replaced...]
sub_result = []
ws = [(w, False), ] * (len(split_list) - 1)
for x, replaced in zip(split_list, ws):
sub_result.append(x)
sub_result.append(replaced)
sub_result.append(split_list[-1])
# Add to new result
new_result.extend(sub_result)
# If not - just add it to results
else:
new_result.extend(split_list)
result = new_result
return result
if __name__ == '__main__':
initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = get_replaced_list(initial_string, words_to_replace)
modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
final_string = ''.join([x[0] for x in modified_list])
以上示例的以下变量值:
initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'
如您所见,列表包含元组。它们包含两个值 - some string
和boolean
,表示它是文本还是替换值(文本时为True
)。
获得替换列表后,您可以像在示例中一样修改它,检查它是否为文本值(if x[1] == True
)。
希望有所帮助!
P.S。像 f"some string here {some_variable_here}"
这样的字符串格式需要Python 3.6
答案 1 :(得分:2)
我认为在此任务中使用正则表达式有两个关键:
使用自定义边界,捕获它们,然后将它们替换为短语。
使用函数在两个方向上处理替换匹配。
以下是使用此方法的实现。我稍微调整了你的文字,重复其中一个短语。
import re
from copy import copy
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen"
text = copy(original_text)
#
# The phrases of interest
#
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
#
# Create the mapping dictionaries
#
phrase_to_mwe = {}
mwe_to_phrase = {}
#
# Build the mappings
#
for i, phrase in enumerate(phrases):
mwephrase = "MWEPHRASE{}".format(i)
mwe_to_phrase[mwephrase] = phrase.replace(' ', '_')
phrase_to_mwe[phrase] = mwephrase
#
# Regex match handlers
#
def handle_forward(match):
b1 = match.group(1)
phrase = match.group(2)
b2 = match.group(3)
return b1 + phrase_to_mwe[phrase] + b2
def handle_backward(match):
return mwe_to_phrase[match.group(1)]
#
# The forward regex will look like:
#
# (^|[ ])('s morgen|'s-Hertogenbosch|depository financial institution)([, ]|$)
#
# which captures three components:
#
# (1) Front boundary
# (2) Phrase
# (3) Back boundary
#
# Anchors allow matching at the beginning and end of the text. Addtional boundary characters can be
# added as necessary, e.g. to allow semicolons after a phrase, we could update the back boundary to:
#
# ([,; ]|$)
#
regex_forward = re.compile(r'(^|[ ])(' + '|'.join(phrases) + r')([, ]|$)')
regex_backward = re.compile(r'(MWEPHRASE\d+)')
#
# Pretend we cleaned the text in the middle
#
cleaned = 'MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2 MWEPHRASE0'
#
# Do the translations
#
text1 = regex_forward .sub(handle_forward, text)
text2 = regex_backward.sub(handle_backward, cleaned)
print('original: {}'.format(original_text))
print('text1 : {}'.format(text1))
print('text2 : {}'.format(text2))
运行此会生成:
original: Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen
text1 : Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen MWEPHRASE0
text2 : 's_morgen ik 's-Hertogenbosch depository_financial_institution 's_morgen
答案 2 :(得分:1)
这是您可以使用的策略:
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
# need this module for the reduce function
import functools as fn
#convert phrases into a dictionary of numbered placeholders (tokens)
tokens = { kw:"MWEPHRASE%s"%i for i,kw in enumerate(phrases) }
#replace embedded phrases with their respective token
tokenized = fn.reduce(lambda s,kw: tokens[kw].join(s.split(kw)), phrases, original_text)
#Apply text cleaning logic on the tokenized text
#This assumes the placeholders are left untouched,
#although it's ok to move them around)
cleaned_text = cleanUpfunction(tokenized)
#reverse the token dictionary (to map original phrases to numbered placeholders)
unTokens = {v:k for k,v in tokens.items() }
#rebuild phrases with original text associated to each token (placeholder)
final_text = fn.reduce(lambda s,kw: unTokens[kw].join(s.split(kw)), phrases, cleaned_text)
答案 3 :(得分:1)
您正在寻找的是“多字符串搜索”或“多模式搜索”。更常见的解决方案是Aho-Corasick和Rabin-Karp算法。如果你想自己强迫它,那就选择Rabin-Karp,因为它更容易掌握。否则,你会找到一些库。这是库https://pypi.python.org/pypi/py_aho_corasick的解决方案。
让
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
并且,出于测试目的:
def clean(text):
"""A simple stub"""
assert text == 'Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen'
return "MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2"
现在,你必须定义两个自动机,一个用于向外旅程,另一个用于返回。自动机由(键,值)列表定义:
fore_automaton = py_aho_corasick.Automaton([(phrase,"MWEPHRASE{}".format(i)) for i, phrase in enumerate(phrases)])
back_automaton = py_aho_corasick.Automaton([("MWEPHRASE{}".format(i), phrase.replace(' ','_')) for i, phrase in enumerate(phrases)])
自动机将扫描文本并返回匹配列表。匹配是三元组(位置,键,值)。通过对匹配进行一些工作,您将能够通过值替换键:
def process(automaton, text):
"""Returns a new text, with keys of the automaton replaced by values"""
matches = automaton.get_keywords_found(text.lower()) # text.lower() because auomaton of py_aho_corasick uses lowercase for keys
bk_value_eks = [(i,v,i+len(k)) for i,k,v in matches] # (begin of key, value, end of key)
chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] <= bk_value_ek2[0]] # see below
return "".join(chunks)
关于chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] <= bk_value_ek2[0]]
的简要说明。
我几乎像往常一样压缩自己的匹配:zip(arr, arr[1:])
将输出(arr[0], arr[1)), (arr[1], arr[2]), ...
来考虑与其成功的每场比赛。我在这里放了两个哨兵
处理比赛的开始和结束。
bk_value_ek1[1]
)以及键的结尾和下一个键的开头(text[bk_value_ek1[2]:bk_value_ek2[0]
)之间的文本。 密钥重叠时会发生什么?举个例子:text="abcdef"
,phrases={"bcd":"1", "cde":"2"}
。您有两个匹配项:(1, "bcd", "1")
和(2, "cde", "3")
。
我们走吧:bk_value_eks = [(1, "1", 4), (2, "2", 5)]
。因此,如果没有if bk_value_ek1[2] <= bk_value_ek2[0]
,则文本将替换为text[:1]+"1"+text[4:2]+"2"+text[5:]
,
这是"a"+"1"+""+"2"+"f"
= "a12f"
而不是"a1ef"
(忽略第二场比赛)......
现在,看看结果:
print(process(back_automaton, clean(process(fore_automaton, original_text))))
# "'s_morgen ik 's-Hertogenbosch depository_financial_institution"
您不必为返回定义新的process
函数,只需将其赋予back_automaton
即可完成此任务。