我有一个重复的字符字符串。我的工作是在该字符串中找到每个唯一字符的开始索引和结束索引。下面是我的代码。
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
输出:
a 0 1
b 3 4
c 7 8
此处字符的结尾索引不正确。我知道为什么会发生这种情况,但是如何将要动态匹配的字符传递给正则表达式搜索功能。例如,如果我在搜索功能中对字符进行硬编码,它将提供所需的输出
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
输出:
b 2 5
上面的函数提供了正确的结果,但是在这里我无法传递要动态匹配的字符。 如果有人可以让我知道如何实现这一点,那将真的有帮助。预先感谢
答案 0 :(得分:1)
String literal formatting进行救援:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
输出:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
您的模式不需要在字母周围使用[]
-只需匹配一个即可。
没有正则表达式 1 :
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
输出:
a 0 2 # not off by 1
b 3 6
c 7 8
答案 1 :(得分:1)
看着输出,我猜可能是另一种选择,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
a 0 3
b 3 7
c 7 9
我认为它将是N的顺序,不过,如果愿意,您可以对其进行基准测试。
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)