在python中的字符串中查找每个唯一字符的开始和结束索引

时间:2019-09-21 15:03:57

标签: regex python-3.x pyspark

我有一个重复的字符字符串。我的工作是在该字符串中找到每个唯一字符的开始索引和结束索引。下面是我的代码。

import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
     mo = re.search(item,x)
     flag = item
     m = mo.start()
     n = mo.end()
     print(flag,m,n)

输出:

a 0 1
b 3 4
c 7 8

此处字符的结尾索引不正确。我知道为什么会发生这种情况,但是如何将要动态匹配的字符传递给正则表达式搜索功能。例如,如果我在搜索功能中对字符进行硬编码,它将提供所需的输出

x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)

输出:

b 2 5

上面的函数提供了正确的结果,但是在这里我无法传递要动态匹配的字符。 如果有人可以让我知道如何实现这一点,那将真的有帮助。预先感谢

2 个答案:

答案 0 :(得分:1)

String literal formatting进行救援:

import re

x = "aaabbbbcc"
xs = set(x)
for item in xs:
    # for patterns better use raw strings - and format the letter into it
    mo = re.search(fr"{item}+",x)  # fr and rf work both :) its a raw formatted literal
    flag = item
    m = mo.start()
    n = mo.end()
    print(flag,m,n)  # fix upper limit by n-1

输出:

a 0 3   # you do see that the upper limit is off by 1?
b 3 7   # see above for fix
c 7 9

您的模式不需要在字母周围使用[]-只需匹配一个即可。


没有正则表达式 1

x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
    if last_ch == ch:
        continue
    else:
        print(last_ch,start_idx, idx-1)
        last_ch = ch
        start_idx = idx
print(ch,start_idx,idx)

输出:

a 0 2   # not off by 1
b 3 6
c 7 8

1 RegEx: And now you have 2 problems...

答案 1 :(得分:1)

看着输出,我猜可能是另一种选择,

import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)

start = 0
output = '' 
for item in xs:
    end = start + len(item[0])
    output += (f"{item[1]} {start} {end}\n")
    start = end

print(output)

输出

a 0 3
b 3 7
c 7 9

我认为它将是N的顺序,不过,如果愿意,您可以对其进行基准测试。

import re, time

timer_on = time.time()

for i in range(10000000):
    x = "aabbbbccc"
    xs = re.findall(r"((.)\2*)", x)

    start = 0
    output = '' 
    for item in xs:
        end = start + len(item[0])
        output += (f"{item[1]} {start} {end}\n")
        start = end

timer_off = time.time()

timer_total = timer_off - timer_on

print(timer_total)