将字母串中的每个字母加倍

时间:2013-06-03 23:45:44

标签: python string

Python中每个字母加倍(或重复n次)的最有效方法是什么?

"abcd" -> "aabbccdd"

"abcd" -> "aaaabbbbccccdddd"

我有一个需要以这种方式进行变异的长字符串,当前的解决方案涉及一个每个字母都有n个连接的循环,我想这可能更有效。

6 个答案:

答案 0 :(得分:10)

使用str.join

>>> strs = "abcd"
>>> "".join([x*2 for x in strs])
'aabbccdd'
>>> "".join([x*4 for x in strs])
'aaaabbbbccccdddd'

来自docs

s = ""
for substring in list:
    s += substring

请改用s = "".join(list)。在构建大型字符串时,前者是一个非常常见和灾难性的错误。

答案 1 :(得分:6)

因为您特别询问了效率:

# drewk's answer, optimized by using from_iterable instead of *
def double_chain(s):
    return ''.join(chain.from_iterable(zip(s, s)))

# Ashwini Chaudhary's answer
def double_mult(s):
    return ''.join([x*2 for x in s])

# Jon Clements' answer, but optimized to take the re.compile and *2 out of the loop.
r = re.compile('(.)')
def double_re(s):
    return r.sub(r'\1\1', s)

现在:

In [499]: %timeit double_chain('abcd')
1000000 loops, best of 3: 1.99 us per loop
In [500]: %timeit double_mult('abcd')
1000000 loops, best of 3: 1.25 us per loop
In [501]: %timeit double_re('abcd')
10000 loops, best of 3: 22.2 us per loop

因此,itertools方法比最简单的方法慢约60%,并且使用正则表达式仍然慢了一个数量级。

但像这样的小字符串可能无法代表较长的字符串,所以:

In [504]: %timeit double_chain('abcd' * 10000)
100 loops, best of 3: 4.92 ms per loop
In [505]: %timeit double_mult('abcd' * 10000)
100 loops, best of 3: 5.57 ms per loop
In [506]: %timeit double_re('abcd' * 10000)
10 loops, best of 3: 91.5 ms per loop

正如所料,itertools方法变得更好(现在胜过简单方法),并且随着字符串变长,正则表达式变得更糟。

所以,没有一种“最有效”的方式。如果你将数十亿的小弦加倍,那么Ashwini的答案是最好的。如果你将数百万的大字符串或成千上万的大字符串加倍,那么drewk是最好的。如果你不做任何事情......首先没有理由对此进行优化。

此外,通常的警告:这个测试是我的Mac上没有负载的64位CPython 3.3.0;不保证您的应用中的Python实现,版本和平台与您的真实数据一样。使用32位2.6的快速测试显示了类似的结果,但如果重要,您需要自己进行更现实和相关的测试。

答案 2 :(得分:2)

您可以使用join,izip和chain:

>>> st='abcd'
>>> from itertools import chain,izip
>>> ''.join(chain(*izip(st,st)))
'aabbccdd'

虽然它比列表理解的可读性差,但优点是没有中间列表; izipchain生成迭代器。

答案 3 :(得分:1)

我会选择str.join,所以我会提供re替代选项:

>>> s = "abcd"
>>> import re
>>> re.sub('(.)', r'\1' * 2, s)
'aabbccdd'

答案 4 :(得分:1)

每当问题是:“将字符串的每个字符映射到其他字符的最有效方法是什么”结果str.translate是最好的选择...对于足够大的字符串:

def double_translate(s):
    return s.translate({ord(x):2*x for x in set(s)})

Timings反对其他答案:

In [5]: %timeit double_chain('abcd')
The slowest run took 11.03 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 992 ns per loop

In [6]: %timeit double_chain('mult')
The slowest run took 13.61 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1 µs per loop

In [7]: %timeit double_mult('abcd')
The slowest run took 7.59 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 869 ns per loop

In [8]: %timeit double_re('abcd')
The slowest run took 8.63 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 9.4 µs per loop

In [9]: %timeit double_translate('abcd')
The slowest run took 5.80 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.78 µs per loop

In [10]: %%timeit t='abcd'*5000
    ...: double_chain(t)
    ...: 
1000 loops, best of 3: 1.66 ms per loop

In [11]: %%timeit t='abcd'*5000
    ...: double_mult(t)
    ...: 
100 loops, best of 3: 2.35 ms per loop

In [12]: %%timeit t='abcd'*5000
    ...: double_re(t)
    ...: 
10 loops, best of 3: 30 ms per loop

In [13]: %%timeit t='abcd'*5000
    ...: double_translate(t)
    ...: 
1000 loops, best of 3: 1.03 ms per loop

但请注意,此解决方案具有额外的优势,在某些情况下,您可能会避免重新构建要传递给translate的表,例如:

def double_translate_opt(s, table=None):
    if table is None:
        table = {ord(x):2*x for x in set(s)}
    return s.translate(table)

这样可以避免一些开销,使其更快:

In [19]: %%timeit t='abcd'; table={ord(x):2*x for x in t}
    ...: double_translate_opt(t, table)
    ...: 
The slowest run took 17.59 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 452 ns per loop

正如你可以看到的那样,只要您避免每次构建表格,它都是当前答案的两倍。 对于长文本,构建表的成本以翻译速度偿还(在这些情况下使用set值得,以避免多次调用ord

答案 5 :(得分:0)

def double_letter(str):
    strReturn = ''
    for chr in str:
        strReturn += chr*n
    return strReturn