尝试匹配并标记基于字符的n-gram。字符串
txt = "how does this work"
与列表中的n-gram相匹配
ngrams = ["ow ", "his", "s w"]
并标有<>
- 但是,仅当没有先前打开的报价时。我正在寻找这个字符串的输出是h<ow >does t<his w>ork
(注意2 -nd 部分中的双匹配,但只有1对预期的引号)。
我试过的for循环并没有产生所需的输出:
switch = False
for i in txt:
if i in "".join(ngrams) and switch == False:
txt = txt.replace(i, "<" + i)
switch = True
if i not in "".join(ngrams) and switch == True:
txt = txt.replace(i, ">" + i)
switch = False
print(txt)
非常感谢任何帮助。
答案 0 :(得分:2)
这应该有效:
else
答案 1 :(得分:2)
此解决方案使用str.find
方法在txt
字符串中查找ngram的所有副本,将每个副本的索引保存到indices
集,以便我们可以轻松处理重叠匹配。
然后我们将txt
,char by char复制到result
列表,在需要的地方插入尖括号。此策略比使用多个.replace
调用插入尖括号更有效,因为每个.replace
调用都需要重建整个字符串。
我稍微扩展了你的数据,以说明我的代码处理了ngram的多个副本。
txt = "how does this work now chisolm"
ngrams = ["ow ", "his", "s w"]
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
slen = len(s)
lo = 0
while True:
i = txt.find(s, lo)
if i == -1:
break
lo = i + slen
print(s, i)
indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
if switch:
if i in indices:
result.append('<')
switch = False
result.append(u)
else:
result.append(u)
if i not in indices:
result.append('>')
switch = True
print(''.join(result))
<强>输出强>
how does this work now chisolm
['ow ', 'his', 's w']
ow 1
ow 20
his 10
his 24
s w 12
{1, 2, 10, 11, 12, 13, 20, 21, 24, 25}
h<ow >does t<his w>ork n<ow >c<his>olm
如果要合并相邻的组,我们可以使用str.replace
方法轻松完成。但为了使其正常工作,我们需要预处理原始数据,将所有空白行转换为单个空格。一种简单的方法是拆分数据并重新加入。
txt = "how does this\nwork now chisolm hisow"
ngrams = ["ow", "his", "work"]
#Convert all whitespace to single spaces
txt = ' '.join(txt.split())
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
slen = len(s)
lo = 0
while True:
i = txt.find(s, lo)
if i == -1:
break
lo = i + slen
print(s, i)
indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
if switch:
if i in indices:
result.append('<')
switch = False
result.append(u)
else:
result.append(u)
if i not in indices:
result.append('>')
switch = True
# Convert the list to a single string
output = ''.join(result)
# Merge adjacent groups
output = output.replace('> <', ' ').replace('><', '')
print(output)
<强>输出强>
how does this work now chisolm hisow
['ow', 'his', 'work']
ow 1
ow 20
ow 34
his 10
his 24
his 31
work 14
{32, 1, 34, 10, 11, 14, 15, 16, 20, 24, 25, 31}
h<ow> does t<his work> n<ow> c<his>olm <hisow>
答案 2 :(得分:1)
这是一个只有一个for循环的方法。我把它计时了,它和这个问题的其他答案一样快。我认为它更清楚一点,尽管那可能是因为我写了它。
我遍历n-gram中第一个字符的索引,如果匹配,我会使用一堆if-else子句来查看是否应该添加<
或>
在这种情况下。我从原始output
添加到字符串txt
的末尾,所以我并没有真正插入字符串的中间。
txt = "how does this work"
ngrams = set(["ow ", "his", "s w"])
n = 3
prev = -n
output = ''
shift = 0
open = False
for i in xrange(len(txt) - n + 1):
ngram = txt[i:i + n]
if ngram in ngrams:
if i - prev > n:
if open:
output += txt[prev:prev + n] + '>' + txt[prev + n:i] + '<'
elif not open:
if prev > 0:
output += txt[prev + n:i] + '<'
else:
output += txt[:i] + '<'
open = True
else:
output += txt[prev:i]
prev = i
if open:
output += txt[prev:prev + n] + '>' + txt[prev + n:]
print output