如何避免子串

时间:2010-12-20 00:21:36

标签: python regex string

我目前处理的字符串部分如下:

for (i, j) in huge_list_of_indices:
    process(huge_text_block[i:j])

我想避免生成这些临时子串的开销。有任何想法吗?也许是以某种方式使用索引偏移的包装器?这是我目前的瓶颈。

请注意, process()是另一个期望字符串作为输入的python模块。

修改

有些人怀疑是否存在问题。以下是一些示例结果:

import time
import string
text = string.letters * 1000

def timeit(fn):
    t1 = time.time()
    for i in range(len(text)):
        fn(i)
    t2 = time.time()
    print '%s took %0.3f ms' % (fn.func_name, (t2-t1) * 1000)

def test_1(i):
    return text[i:]

def test_2(i):
    return text[:]

def test_3(i):
    return text

timeit(test_1)
timeit(test_2)
timeit(test_3)

输出:

test_1 took 972.046 ms
test_2 took 47.620 ms
test_3 took 43.457 ms

6 个答案:

答案 0 :(得分:8)

我认为您要找的是buffers

缓冲区的特点是它们“切片”支持缓冲区接口的对象而不复制其内容,但基本上在切片的对象内容上打开“窗口”。可以获得更多技术解释here。摘录:

  

用C实现的Python对象可以导出一组称为“缓冲区接口”的函数。对象可以使用这些函数以原始的,面向字节的格式公开其数据。对象的客户端可以使用缓冲区接口直接访问对象数据,而无需先复制它。

在您的情况下,代码应该或多或少看起来像这样:

>>> s = 'Hugely_long_string_not_to_be_copied'
>>> ij = [(0, 3), (6, 9), (12, 18)]
>>> for i, j in ij:
...     print buffer(s, i, j-i)  # Should become process(...)
Hug
_lo
string

HTH!

答案 1 :(得分:3)

使用索引偏移到mmap对象的包装器可以工作,是的。

但在你这样做之前,你确定生成这些子串是一个问题吗?在找到时间和内存实际发生的位置之前不要进行优化。我不认为这是一个重大问题。

答案 2 :(得分:1)

如果您使用的是Python3,则可以使用协议缓冲区和内存视图。假设文本存储在文件系统中的某个位置:

f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)

mv = memoryview(data)

for (i, j) in huge_list_of_indices:
    process(mv[i:j])

另请查看this文章。它可能很有用。

答案 3 :(得分:0)

也许使用索引偏移的包装器确实是您正在寻找的。这是一个完成工作的例子。您可能需要根据需要在切片上添加更多检查(对于溢出和负索引)。

#!/usr/bin/env python

from collections import Sequence
from timeit import Timer

def process(s):
    return s[0], len(s)

class FakeString(Sequence):
    def __init__(self, string):
        self._string = string
        self.fake_start = 0
        self.fake_stop = len(string)

    def setFakeIndices(self, i, j):
        self.fake_start = i
        self.fake_stop = j

    def __len__(self):
        return self.fake_stop - self.fake_start

    def __getitem__(self, ii):
        if isinstance(ii, slice):
            if ii.start is None:
                start = self.fake_start
            else:
                start = ii.start + self.fake_start
            if ii.stop is None:
                stop = self.fake_stop
            else:
                stop = ii.stop + self.fake_start
            ii = slice(start,
                       stop,
                       ii.step)
        else:
            ii = ii + self.fake_start
        return self._string[ii]

def initial_method():
    r = []
    for n in xrange(1000):
        r.append(process(huge_string[1:9999999]))
    return r

def alternative_method():
    r = []
    for n in xrange(1000):
        fake_string.setFakeIndices(1, 9999999)
        r.append(process(fake_string))
    return r


if __name__ == '__main__':
    huge_string = 'ABCDEFGHIJ' * 100000
    fake_string = FakeString(huge_string)

    fake_string.setFakeIndices(5,15)
    assert fake_string[:] == huge_string[5:15]

    t = Timer(initial_method)
    print "initial_method(): %fs" % t.timeit(number=1)

给出:

initial_method(): 1.248001s  
alternative_method(): 0.003416s

答案 4 :(得分:0)

OP提供的示例将在切片和切片之间提供几乎最大的性能差异。

如果处理实际上需要花费大量时间,那么问题可能几乎不存在。

事实是OP需要让我们知道什么是流程。最可能的情况是它做了一些重要的事情,因此他应该描述他的代码。

改编自op的例子:

#slice_time.py

import time
import string
text = string.letters * 1000
import random
indices = range(len(text))
random.shuffle(indices)
import re


def greater_processing(a_string):
    results = re.findall('m', a_string)

def medium_processing(a_string):
    return re.search('m.*?m', a_string)                                                                              

def lesser_processing(a_string):
    return re.match('m', a_string)

def least_processing(a_string):
    return a_string

def timeit(fn, processor):
    t1 = time.time()
    for i in indices:
        fn(i, i + 1000, processor)
    t2 = time.time()
    print '%s took %0.3f ms %s' % (fn.func_name, (t2-t1) * 1000, processor.__name__)

def test_part_slice(i, j, processor):
    return processor(text[i:j])

def test_copy(i, j, processor):
    return processor(text[:])

def test_text(i, j, processor):
    return processor(text)

def test_buffer(i, j, processor):
    return processor(buffer(text, i, j - i))

if __name__ == '__main__':
    processors = [least_processing, lesser_processing, medium_processing, greater_processing]
    tests = [test_part_slice, test_copy, test_text, test_buffer]
    for processor in processors:
        for test in tests:
            timeit(test, processor)

然后跑步......

In [494]: run slice_time.py
test_part_slice took 68.264 ms least_processing
test_copy took 42.988 ms least_processing
test_text took 33.075 ms least_processing
test_buffer took 76.770 ms least_processing
test_part_slice took 270.038 ms lesser_processing
test_copy took 197.681 ms lesser_processing
test_text took 196.716 ms lesser_processing
test_buffer took 262.288 ms lesser_processing
test_part_slice took 416.072 ms medium_processing
test_copy took 352.254 ms medium_processing
test_text took 337.971 ms medium_processing
test_buffer took 438.683 ms medium_processing
test_part_slice took 502.069 ms greater_processing
test_copy took 8149.231 ms greater_processing
test_text took 8292.333 ms greater_processing
test_buffer took 563.009 ms greater_processing

注意:

是的,我用[i:]切片尝试了OP的原始test_1,速度慢得多,让他的测试更加糟糕。

有趣的是缓冲区几乎总是比切片稍微慢一些。这次有一个它做得更好!真正的测试是在下面,缓冲似乎对更大的子串更好,而切片对更小的子串更好。

而且,是的,我确实在这个测试中有一些随机性,所以测试一下,看看不同的结果:)。改变1000的大小也可能很有意思。

所以,也许其他人相信你,但我不,所以我想知道处理的内容以及你如何得出结论:" 切片就是问题。"

我在我的示例中描述了中等处理,并将string.letters乘数增加到100000并将切片的长度提高到10000.下面是一个长度为100的切片。我使用了cProfile(更少的开销然后配置文件!)。

test_part_slice took 77338.285 ms medium_processing
         31200019 function calls in 77.338 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   77.338   77.338 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 iostream.py:63(write)
  5200000    8.208    0.000   43.823    0.000 re.py:139(search)
  5200000    9.205    0.000   12.897    0.000 re.py:228(_compile)
  5200000    5.651    0.000   49.475    0.000 slice_time.py:15(medium_processing)
        1    7.901    7.901   77.338   77.338 slice_time.py:24(timeit)
  5200000   19.963    0.000   69.438    0.000 slice_time.py:31(test_part_slice)
        2    0.000    0.000    0.000    0.000 utf_8.py:15(decode)
        2    0.000    0.000    0.000    0.000 {_codecs.utf_8_decode}
        2    0.000    0.000    0.000    0.000 {isinstance}
        2    0.000    0.000    0.000    0.000 {method 'decode' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  5200000    3.692    0.000    3.692    0.000 {method 'get' of 'dict' objects}
  5200000   22.718    0.000   22.718    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
        2    0.000    0.000    0.000    0.000 {method 'write' of '_io.StringIO' objects}
        4    0.000    0.000    0.000    0.000 {time.time}


test_buffer took 58067.440 ms medium_processing
         31200103 function calls in 58.068 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   58.068   58.068 <string>:1(<module>)
        3    0.000    0.000    0.000    0.000 __init__.py:185(dumps)
        3    0.000    0.000    0.000    0.000 encoder.py:102(__init__)
        3    0.000    0.000    0.000    0.000 encoder.py:180(encode)
        3    0.000    0.000    0.000    0.000 encoder.py:206(iterencode)
        1    0.000    0.000    0.001    0.001 iostream.py:37(flush)
        2    0.000    0.000    0.001    0.000 iostream.py:63(write)
        1    0.000    0.000    0.000    0.000 iostream.py:86(_new_buffer)
        3    0.000    0.000    0.000    0.000 jsonapi.py:57(_squash_unicode)
        3    0.000    0.000    0.000    0.000 jsonapi.py:69(dumps)
        2    0.000    0.000    0.000    0.000 jsonutil.py:78(date_default)
        1    0.000    0.000    0.000    0.000 os.py:743(urandom)
  5200000    6.814    0.000   39.110    0.000 re.py:139(search)
  5200000    7.853    0.000   10.878    0.000 re.py:228(_compile)
        1    0.000    0.000    0.000    0.000 session.py:149(msg_header)
        1    0.000    0.000    0.000    0.000 session.py:153(extract_header)
        1    0.000    0.000    0.000    0.000 session.py:315(msg_id)
        1    0.000    0.000    0.000    0.000 session.py:350(msg_header)
        1    0.000    0.000    0.000    0.000 session.py:353(msg)
        1    0.000    0.000    0.000    0.000 session.py:370(sign)
        1    0.000    0.000    0.000    0.000 session.py:385(serialize)
        1    0.000    0.000    0.001    0.001 session.py:437(send)
        3    0.000    0.000    0.000    0.000 session.py:75(<lambda>)
  5200000    4.732    0.000   43.842    0.000 slice_time.py:15(medium_processing)
        1    5.423    5.423   58.068   58.068 slice_time.py:24(timeit)
  5200000    8.802    0.000   52.645    0.000 slice_time.py:40(test_buffer)
        7    0.000    0.000    0.000    0.000 traitlets.py:268(__get__)
        2    0.000    0.000    0.000    0.000 utf_8.py:15(decode)
        1    0.000    0.000    0.000    0.000 uuid.py:101(__init__)
        1    0.000    0.000    0.000    0.000 uuid.py:197(__str__)
        1    0.000    0.000    0.000    0.000 uuid.py:531(uuid4)
        2    0.000    0.000    0.000    0.000 {_codecs.utf_8_decode}
        1    0.000    0.000    0.000    0.000 {built-in method now}
       18    0.000    0.000    0.000    0.000 {isinstance}
        4    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {locals}
        1    0.000    0.000    0.000    0.000 {map}
        2    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'close' of '_io.StringIO' objects}
        1    0.000    0.000    0.000    0.000 {method 'count' of 'list' objects}
        2    0.000    0.000    0.000    0.000 {method 'decode' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
  5200001    3.025    0.000    3.025    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'getvalue' of '_io.StringIO' objects}
        3    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
  5200000   21.418    0.000   21.418    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
        1    0.000    0.000    0.000    0.000 {method 'send_multipart' of 'zmq.core.socket.Socket' objects}
        2    0.000    0.000    0.000    0.000 {method 'strftime' of 'datetime.date' objects}
        1    0.000    0.000    0.000    0.000 {method 'update' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'write' of '_io.StringIO' objects}
        1    0.000    0.000    0.000    0.000 {posix.close}
        1    0.000    0.000    0.000    0.000 {posix.open}
        1    0.000    0.000    0.000    0.000 {posix.read}
        4    0.000    0.000    0.000    0.000 {time.time}

较小的切片(100长度)。

test_part_slice took 54916.153 ms medium_processing
         31200019 function calls in 54.916 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   54.916   54.916 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 iostream.py:63(write)
  5200000    6.788    0.000   38.312    0.000 re.py:139(search)
  5200000    8.014    0.000   11.257    0.000 re.py:228(_compile)
  5200000    4.722    0.000   43.034    0.000 slice_time.py:15(medium_processing)
        1    5.594    5.594   54.916   54.916 slice_time.py:24(timeit)
  5200000    6.288    0.000   49.322    0.000 slice_time.py:31(test_part_slice)
        2    0.000    0.000    0.000    0.000 utf_8.py:15(decode)
        2    0.000    0.000    0.000    0.000 {_codecs.utf_8_decode}
        2    0.000    0.000    0.000    0.000 {isinstance}
        2    0.000    0.000    0.000    0.000 {method 'decode' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  5200000    3.242    0.000    3.242    0.000 {method 'get' of 'dict' objects}
  5200000   20.268    0.000   20.268    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
        2    0.000    0.000    0.000    0.000 {method 'write' of '_io.StringIO' objects}
        4    0.000    0.000    0.000    0.000 {time.time}


test_buffer took 62019.684 ms medium_processing
         31200103 function calls in 62.020 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   62.020   62.020 <string>:1(<module>)
        3    0.000    0.000    0.000    0.000 __init__.py:185(dumps)
        3    0.000    0.000    0.000    0.000 encoder.py:102(__init__)
        3    0.000    0.000    0.000    0.000 encoder.py:180(encode)
        3    0.000    0.000    0.000    0.000 encoder.py:206(iterencode)
        1    0.000    0.000    0.001    0.001 iostream.py:37(flush)
        2    0.000    0.000    0.001    0.000 iostream.py:63(write)
        1    0.000    0.000    0.000    0.000 iostream.py:86(_new_buffer)
        3    0.000    0.000    0.000    0.000 jsonapi.py:57(_squash_unicode)
        3    0.000    0.000    0.000    0.000 jsonapi.py:69(dumps)
        2    0.000    0.000    0.000    0.000 jsonutil.py:78(date_default)
        1    0.000    0.000    0.000    0.000 os.py:743(urandom)
  5200000    7.426    0.000   41.152    0.000 re.py:139(search)
  5200000    8.470    0.000   11.628    0.000 re.py:228(_compile)
        1    0.000    0.000    0.000    0.000 session.py:149(msg_header)
        1    0.000    0.000    0.000    0.000 session.py:153(extract_header)
        1    0.000    0.000    0.000    0.000 session.py:315(msg_id)
        1    0.000    0.000    0.000    0.000 session.py:350(msg_header)
        1    0.000    0.000    0.000    0.000 session.py:353(msg)
        1    0.000    0.000    0.000    0.000 session.py:370(sign)
        1    0.000    0.000    0.000    0.000 session.py:385(serialize)
        1    0.000    0.000    0.001    0.001 session.py:437(send)
        3    0.000    0.000    0.000    0.000 session.py:75(<lambda>)
  5200000    5.399    0.000   46.551    0.000 slice_time.py:15(medium_processing)
        1    5.958    5.958   62.020   62.020 slice_time.py:24(timeit)
  5200000    9.510    0.000   56.061    0.000 slice_time.py:40(test_buffer)
        7    0.000    0.000    0.000    0.000 traitlets.py:268(__get__)
        2    0.000    0.000    0.000    0.000 utf_8.py:15(decode)
        1    0.000    0.000    0.000    0.000 uuid.py:101(__init__)
        1    0.000    0.000    0.000    0.000 uuid.py:197(__str__)
        1    0.000    0.000    0.000    0.000 uuid.py:531(uuid4)
        2    0.000    0.000    0.000    0.000 {_codecs.utf_8_decode}
        1    0.000    0.000    0.000    0.000 {built-in method now}
       18    0.000    0.000    0.000    0.000 {isinstance}
        4    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {locals}
        1    0.000    0.000    0.000    0.000 {map}
        2    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'close' of '_io.StringIO' objects}
        1    0.000    0.000    0.000    0.000 {method 'count' of 'list' objects}
        2    0.000    0.000    0.000    0.000 {method 'decode' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
  5200001    3.158    0.000    3.158    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'getvalue' of '_io.StringIO' objects}
        3    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
  5200000   22.097    0.000   22.097    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
        1    0.000    0.000    0.000    0.000 {method 'send_multipart' of 'zmq.core.socket.Socket' objects}
        2    0.000    0.000    0.000    0.000 {method 'strftime' of 'datetime.date' objects}
        1    0.000    0.000    0.000    0.000 {method 'update' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'write' of '_io.StringIO' objects}
        1    0.000    0.000    0.000    0.000 {posix.close}
        1    0.000    0.000    0.000    0.000 {posix.open}
        1    0.000    0.000    0.000    0.000 {posix.read}
        4    0.000    0.000    0.000    0.000 {time.time}

答案 5 :(得分:-2)

  

进程(huge_text_block [i:j])

     

我想避免 生成这些临时子字符串的开销
  (...)
  请注意, process()是另一个python模块   期望字符串作为输入

完全矛盾 你怎么想象找到一种不创造函数所需要的方法?!