我目前处理的字符串部分如下:
for (i, j) in huge_list_of_indices:
process(huge_text_block[i:j])
我想避免生成这些临时子串的开销。有任何想法吗?也许是以某种方式使用索引偏移的包装器?这是我目前的瓶颈。
请注意, process()是另一个期望字符串作为输入的python模块。
修改
有些人怀疑是否存在问题。以下是一些示例结果:
import time
import string
text = string.letters * 1000
def timeit(fn):
t1 = time.time()
for i in range(len(text)):
fn(i)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1) * 1000)
def test_1(i):
return text[i:]
def test_2(i):
return text[:]
def test_3(i):
return text
timeit(test_1)
timeit(test_2)
timeit(test_3)
输出:
test_1 took 972.046 ms
test_2 took 47.620 ms
test_3 took 43.457 ms
答案 0 :(得分:8)
我认为您要找的是buffers。
缓冲区的特点是它们“切片”支持缓冲区接口的对象而不复制其内容,但基本上在切片的对象内容上打开“窗口”。可以获得更多技术解释here。摘录:
用C实现的Python对象可以导出一组称为“缓冲区接口”的函数。对象可以使用这些函数以原始的,面向字节的格式公开其数据。对象的客户端可以使用缓冲区接口直接访问对象数据,而无需先复制它。
在您的情况下,代码应该或多或少看起来像这样:
>>> s = 'Hugely_long_string_not_to_be_copied'
>>> ij = [(0, 3), (6, 9), (12, 18)]
>>> for i, j in ij:
... print buffer(s, i, j-i) # Should become process(...)
Hug
_lo
string
HTH!
答案 1 :(得分:3)
使用索引偏移到mmap对象的包装器可以工作,是的。
但在你这样做之前,你确定生成这些子串是一个问题吗?在找到时间和内存实际发生的位置之前不要进行优化。我不认为这是一个重大问题。
答案 2 :(得分:1)
如果您使用的是Python3,则可以使用协议缓冲区和内存视图。假设文本存储在文件系统中的某个位置:
f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)
mv = memoryview(data)
for (i, j) in huge_list_of_indices:
process(mv[i:j])
另请查看this文章。它可能很有用。
答案 3 :(得分:0)
也许使用索引偏移的包装器确实是您正在寻找的。这是一个完成工作的例子。您可能需要根据需要在切片上添加更多检查(对于溢出和负索引)。
#!/usr/bin/env python
from collections import Sequence
from timeit import Timer
def process(s):
return s[0], len(s)
class FakeString(Sequence):
def __init__(self, string):
self._string = string
self.fake_start = 0
self.fake_stop = len(string)
def setFakeIndices(self, i, j):
self.fake_start = i
self.fake_stop = j
def __len__(self):
return self.fake_stop - self.fake_start
def __getitem__(self, ii):
if isinstance(ii, slice):
if ii.start is None:
start = self.fake_start
else:
start = ii.start + self.fake_start
if ii.stop is None:
stop = self.fake_stop
else:
stop = ii.stop + self.fake_start
ii = slice(start,
stop,
ii.step)
else:
ii = ii + self.fake_start
return self._string[ii]
def initial_method():
r = []
for n in xrange(1000):
r.append(process(huge_string[1:9999999]))
return r
def alternative_method():
r = []
for n in xrange(1000):
fake_string.setFakeIndices(1, 9999999)
r.append(process(fake_string))
return r
if __name__ == '__main__':
huge_string = 'ABCDEFGHIJ' * 100000
fake_string = FakeString(huge_string)
fake_string.setFakeIndices(5,15)
assert fake_string[:] == huge_string[5:15]
t = Timer(initial_method)
print "initial_method(): %fs" % t.timeit(number=1)
给出:
initial_method(): 1.248001s
alternative_method(): 0.003416s
答案 4 :(得分:0)
OP提供的示例将在切片和切片之间提供几乎最大的性能差异。
如果处理实际上需要花费大量时间,那么问题可能几乎不存在。
事实是OP需要让我们知道什么是流程。最可能的情况是它做了一些重要的事情,因此他应该描述他的代码。
改编自op的例子:
#slice_time.py
import time
import string
text = string.letters * 1000
import random
indices = range(len(text))
random.shuffle(indices)
import re
def greater_processing(a_string):
results = re.findall('m', a_string)
def medium_processing(a_string):
return re.search('m.*?m', a_string)
def lesser_processing(a_string):
return re.match('m', a_string)
def least_processing(a_string):
return a_string
def timeit(fn, processor):
t1 = time.time()
for i in indices:
fn(i, i + 1000, processor)
t2 = time.time()
print '%s took %0.3f ms %s' % (fn.func_name, (t2-t1) * 1000, processor.__name__)
def test_part_slice(i, j, processor):
return processor(text[i:j])
def test_copy(i, j, processor):
return processor(text[:])
def test_text(i, j, processor):
return processor(text)
def test_buffer(i, j, processor):
return processor(buffer(text, i, j - i))
if __name__ == '__main__':
processors = [least_processing, lesser_processing, medium_processing, greater_processing]
tests = [test_part_slice, test_copy, test_text, test_buffer]
for processor in processors:
for test in tests:
timeit(test, processor)
然后跑步......
In [494]: run slice_time.py
test_part_slice took 68.264 ms least_processing
test_copy took 42.988 ms least_processing
test_text took 33.075 ms least_processing
test_buffer took 76.770 ms least_processing
test_part_slice took 270.038 ms lesser_processing
test_copy took 197.681 ms lesser_processing
test_text took 196.716 ms lesser_processing
test_buffer took 262.288 ms lesser_processing
test_part_slice took 416.072 ms medium_processing
test_copy took 352.254 ms medium_processing
test_text took 337.971 ms medium_processing
test_buffer took 438.683 ms medium_processing
test_part_slice took 502.069 ms greater_processing
test_copy took 8149.231 ms greater_processing
test_text took 8292.333 ms greater_processing
test_buffer took 563.009 ms greater_processing
注意:
是的,我用[i:]切片尝试了OP的原始test_1,速度慢得多,让他的测试更加糟糕。
有趣的是缓冲区几乎总是比切片稍微慢一些。这次有一个它做得更好!真正的测试是在下面,缓冲似乎对更大的子串更好,而切片对更小的子串更好。
而且,是的,我确实在这个测试中有一些随机性,所以测试一下,看看不同的结果:)。改变1000的大小也可能很有意思。
所以,也许其他人相信你,但我不,所以我想知道处理的内容以及你如何得出结论:" 切片就是问题。"
我在我的示例中描述了中等处理,并将string.letters乘数增加到100000并将切片的长度提高到10000.下面是一个长度为100的切片。我使用了cProfile(更少的开销然后配置文件!)。
test_part_slice took 77338.285 ms medium_processing
31200019 function calls in 77.338 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 77.338 77.338 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 iostream.py:63(write)
5200000 8.208 0.000 43.823 0.000 re.py:139(search)
5200000 9.205 0.000 12.897 0.000 re.py:228(_compile)
5200000 5.651 0.000 49.475 0.000 slice_time.py:15(medium_processing)
1 7.901 7.901 77.338 77.338 slice_time.py:24(timeit)
5200000 19.963 0.000 69.438 0.000 slice_time.py:31(test_part_slice)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
2 0.000 0.000 0.000 0.000 {isinstance}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5200000 3.692 0.000 3.692 0.000 {method 'get' of 'dict' objects}
5200000 22.718 0.000 22.718 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
4 0.000 0.000 0.000 0.000 {time.time}
test_buffer took 58067.440 ms medium_processing
31200103 function calls in 58.068 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 58.068 58.068 <string>:1(<module>)
3 0.000 0.000 0.000 0.000 __init__.py:185(dumps)
3 0.000 0.000 0.000 0.000 encoder.py:102(__init__)
3 0.000 0.000 0.000 0.000 encoder.py:180(encode)
3 0.000 0.000 0.000 0.000 encoder.py:206(iterencode)
1 0.000 0.000 0.001 0.001 iostream.py:37(flush)
2 0.000 0.000 0.001 0.000 iostream.py:63(write)
1 0.000 0.000 0.000 0.000 iostream.py:86(_new_buffer)
3 0.000 0.000 0.000 0.000 jsonapi.py:57(_squash_unicode)
3 0.000 0.000 0.000 0.000 jsonapi.py:69(dumps)
2 0.000 0.000 0.000 0.000 jsonutil.py:78(date_default)
1 0.000 0.000 0.000 0.000 os.py:743(urandom)
5200000 6.814 0.000 39.110 0.000 re.py:139(search)
5200000 7.853 0.000 10.878 0.000 re.py:228(_compile)
1 0.000 0.000 0.000 0.000 session.py:149(msg_header)
1 0.000 0.000 0.000 0.000 session.py:153(extract_header)
1 0.000 0.000 0.000 0.000 session.py:315(msg_id)
1 0.000 0.000 0.000 0.000 session.py:350(msg_header)
1 0.000 0.000 0.000 0.000 session.py:353(msg)
1 0.000 0.000 0.000 0.000 session.py:370(sign)
1 0.000 0.000 0.000 0.000 session.py:385(serialize)
1 0.000 0.000 0.001 0.001 session.py:437(send)
3 0.000 0.000 0.000 0.000 session.py:75(<lambda>)
5200000 4.732 0.000 43.842 0.000 slice_time.py:15(medium_processing)
1 5.423 5.423 58.068 58.068 slice_time.py:24(timeit)
5200000 8.802 0.000 52.645 0.000 slice_time.py:40(test_buffer)
7 0.000 0.000 0.000 0.000 traitlets.py:268(__get__)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 uuid.py:101(__init__)
1 0.000 0.000 0.000 0.000 uuid.py:197(__str__)
1 0.000 0.000 0.000 0.000 uuid.py:531(uuid4)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {built-in method now}
18 0.000 0.000 0.000 0.000 {isinstance}
4 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {locals}
1 0.000 0.000 0.000 0.000 {map}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {method 'count' of 'list' objects}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects}
5200001 3.025 0.000 3.025 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'getvalue' of '_io.StringIO' objects}
3 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
5200000 21.418 0.000 21.418 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
1 0.000 0.000 0.000 0.000 {method 'send_multipart' of 'zmq.core.socket.Socket' objects}
2 0.000 0.000 0.000 0.000 {method 'strftime' of 'datetime.date' objects}
1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {posix.close}
1 0.000 0.000 0.000 0.000 {posix.open}
1 0.000 0.000 0.000 0.000 {posix.read}
4 0.000 0.000 0.000 0.000 {time.time}
较小的切片(100长度)。
test_part_slice took 54916.153 ms medium_processing
31200019 function calls in 54.916 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 54.916 54.916 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 iostream.py:63(write)
5200000 6.788 0.000 38.312 0.000 re.py:139(search)
5200000 8.014 0.000 11.257 0.000 re.py:228(_compile)
5200000 4.722 0.000 43.034 0.000 slice_time.py:15(medium_processing)
1 5.594 5.594 54.916 54.916 slice_time.py:24(timeit)
5200000 6.288 0.000 49.322 0.000 slice_time.py:31(test_part_slice)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
2 0.000 0.000 0.000 0.000 {isinstance}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5200000 3.242 0.000 3.242 0.000 {method 'get' of 'dict' objects}
5200000 20.268 0.000 20.268 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
4 0.000 0.000 0.000 0.000 {time.time}
test_buffer took 62019.684 ms medium_processing
31200103 function calls in 62.020 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 62.020 62.020 <string>:1(<module>)
3 0.000 0.000 0.000 0.000 __init__.py:185(dumps)
3 0.000 0.000 0.000 0.000 encoder.py:102(__init__)
3 0.000 0.000 0.000 0.000 encoder.py:180(encode)
3 0.000 0.000 0.000 0.000 encoder.py:206(iterencode)
1 0.000 0.000 0.001 0.001 iostream.py:37(flush)
2 0.000 0.000 0.001 0.000 iostream.py:63(write)
1 0.000 0.000 0.000 0.000 iostream.py:86(_new_buffer)
3 0.000 0.000 0.000 0.000 jsonapi.py:57(_squash_unicode)
3 0.000 0.000 0.000 0.000 jsonapi.py:69(dumps)
2 0.000 0.000 0.000 0.000 jsonutil.py:78(date_default)
1 0.000 0.000 0.000 0.000 os.py:743(urandom)
5200000 7.426 0.000 41.152 0.000 re.py:139(search)
5200000 8.470 0.000 11.628 0.000 re.py:228(_compile)
1 0.000 0.000 0.000 0.000 session.py:149(msg_header)
1 0.000 0.000 0.000 0.000 session.py:153(extract_header)
1 0.000 0.000 0.000 0.000 session.py:315(msg_id)
1 0.000 0.000 0.000 0.000 session.py:350(msg_header)
1 0.000 0.000 0.000 0.000 session.py:353(msg)
1 0.000 0.000 0.000 0.000 session.py:370(sign)
1 0.000 0.000 0.000 0.000 session.py:385(serialize)
1 0.000 0.000 0.001 0.001 session.py:437(send)
3 0.000 0.000 0.000 0.000 session.py:75(<lambda>)
5200000 5.399 0.000 46.551 0.000 slice_time.py:15(medium_processing)
1 5.958 5.958 62.020 62.020 slice_time.py:24(timeit)
5200000 9.510 0.000 56.061 0.000 slice_time.py:40(test_buffer)
7 0.000 0.000 0.000 0.000 traitlets.py:268(__get__)
2 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 uuid.py:101(__init__)
1 0.000 0.000 0.000 0.000 uuid.py:197(__str__)
1 0.000 0.000 0.000 0.000 uuid.py:531(uuid4)
2 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {built-in method now}
18 0.000 0.000 0.000 0.000 {isinstance}
4 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {locals}
1 0.000 0.000 0.000 0.000 {map}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {method 'count' of 'list' objects}
2 0.000 0.000 0.000 0.000 {method 'decode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects}
5200001 3.158 0.000 3.158 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'getvalue' of '_io.StringIO' objects}
3 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
5200000 22.097 0.000 22.097 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
1 0.000 0.000 0.000 0.000 {method 'send_multipart' of 'zmq.core.socket.Socket' objects}
2 0.000 0.000 0.000 0.000 {method 'strftime' of 'datetime.date' objects}
1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {posix.close}
1 0.000 0.000 0.000 0.000 {posix.open}
1 0.000 0.000 0.000 0.000 {posix.read}
4 0.000 0.000 0.000 0.000 {time.time}
答案 5 :(得分:-2)
进程(huge_text_block [i:j])
我想避免 生成这些临时子字符串的开销。
(...)
请注意, process()是另一个python模块 期望字符串作为输入。
完全矛盾 你怎么想象找到一种不创造函数所需要的方法?!