Question

我有一个长字符串和[end-index, string]列表，如下所示：

long_sentence = "This is a long long long long sentence"
indices = [[6, "is"], [8, "is a"], [18, "long"], [23, "long"]]

元素6, "is"表示6是字符串中单词"is"的结束索引。我想最后得到以下字符串：

>> print long_sentence
This .... long ......... long sentence"

我尝试了这样的方法：

temp = long_sentence
for i in indices:
    temp = temp[:int(i[0]) - len(i[1])] + '.'*(len(i[1])+1) + temp[i[0]+1:]

虽然这似乎有效，但是花费了相当长的时间（300 MB文件中的5000个字符串超过6个小时）。有没有办法加快速度呢？

Answer 1

每次执行temp = temp...赋值时，Python都必须创建一个新字符串（因为Python字符串是不可变的）。

您可能想要做的是将字符串转换为字符列表，然后对字符列表进行操作，然后将该列表重新连接成一个字符串。

long_list = list(long_sentence)
for end, repstr in indices:
    long_list[end-len(repstr):end] = ['.'] * len(repstr)
new_sentence = ''.join(long_list)

Answer 2

我通常会专注于编写最干净，可读的简洁代码并优化第二代;这就是你采取的方法，勇敢！ 6个小时似乎站不住脚，需要时间优化。您是否已经明确地将创建替换字符串的时间与首先生成索引列表所花费的时间分开了？

Benchmarking显示列表推导，加入和假文件是字符串连接最快的。这是一篇相当古老的文章 - 您可能希望自己运行基准来确认结果 - 尽管它可能仍然存在。

Answer 3

您可以使用可变标准 array类型进行字符替换：

>>> import array

>>> long_sentence = "This is a long long long long sentence"
>>> indices = [[6, "is"], [8, "is a"], [18, "long"], [23, "long"]]

>>> temp = array.array('c', long_sentence)  # Could replace long_sentence too
>>> for end, substr in indices:
...     temp[end-len(substr)+1:end+1] = array.array('c', '.'*len(substr))
...     
>>> temp
array('c', 'This .... long .... .... long sentence')

可以使用以下命令将新字符串写入输出文件：

temp.tofile(your_file)

（字符串本身由temp.tostring()返回。）

这种方法的优点是可以防止通过切片创建太多新字符串，这需要时间。另一个优点是内存效率：字符串更新就位（这由temp.buffer_info()中的地址显示，保持不变）。副作用是，这种内存效率可能会让您的计算机避免交换，从而加快速度。

您还可以通过使用自定义'.'*len(substr)方法的特殊课程DotString 缓存 __getitem__字符串来加快速度，其中DotString[4]返回'....'等等。

PS ：大多数优化尝试首先受益于性能分析。您可以使用以下命令对程序进行分析：

python -m cProfile -o stats.prof <Python program name and arguments>

然后您可以使用以下方法分析时间：

python -m pstats stats.prof

您通常运行的第一个命令是sort time（按功能代码严格按照时间排序函数），然后stats 10（前10个最长函数执行）。

您可以在输入文件的截断版本上执行此操作，以便运行时间不会太长。这将告诉您哪些功能占用的时间最多，应该是优化的重点。

PPS ：上例中使用的'c'类型用于字节字符串（通常为ASCII编码）。可以使用'u'处理字符串（也称为unicode字符串）。

Answer 4

您可以通过使用成员资格测试集和 str.join 来结合结果来避免O（n）行为：

>>> redacts = set()
>>> indices = [[6, "is"], [8, "is a"], [18, "long"], [23, "long"]]
>>> for end, substr in indices:
        redacts.update(range(end-len(substr)+1, end+1))
>>> ''.join([('.' if i in redacts else c) for i, c in enumerate(long_sentence)])
'This .... long .... .... long sentence'

或者，您可以使用 bytearray ，它允许您就地改变“字符串”：

>>> arr = bytearray(long_sentence)
>>> for end, substr in indices:
        arr[end-len(substr)+1: end+1] = '.' * len(substr)
>>> str(arr)
'This .... long .... .... long sentence'

后一种技术仅适用于非unicode字符串。

擦除子串的有效方法是什么？

4 个答案: