tempfile.TemporaryFile与StringIO

时间:2016-02-08 20:16:12

标签: python stringio cstringio

我写了一个小基准,我在其中比较ZOCache的不同字符串连接方法。

所以它看起来像tempfile.TemporaryFile比其他任何东西都要快:

$ python src/ZOCache/tmp_benchmark.py 
3.00407409668e-05 TemporaryFile
0.385630846024 SpooledTemporaryFile
0.299962997437 BufferedRandom
0.0849719047546 io.StringIO
0.113346099854 concat

我一直在使用的基准代码:

#!/usr/bin/python
from __future__ import print_function
import io
import timeit
import tempfile


class Error(Exception):
    pass


def bench_temporaryfile():
    with tempfile.TemporaryFile(bufsize=10*1024*1024) as out:
        for i in range(0, 100):
            out.write(b"Value = ")
            out.write(bytes(i))
            out.write(b" ")

        # Get string.
        out.seek(0)
        contents = out.read()
        out.close()
        # Test first letter.
        if contents[0:5] != b"Value":
            raise Error


def bench_spooledtemporaryfile():
    with tempfile.SpooledTemporaryFile(max_size=10*1024*1024) as out:
        for i in range(0, 100):
            out.write(b"Value = ")
            out.write(bytes(i))
            out.write(b" ")

        # Get string.
        out.seek(0)
        contents = out.read()
        out.close()
        # Test first letter.
        if contents[0:5] != b"Value":
            raise Error


def bench_BufferedRandom():
    # 1. BufferedRandom
    with io.open('out.bin', mode='w+b') as fp:
        with io.BufferedRandom(fp, buffer_size=10*1024*1024) as out:
            for i in range(0, 100):
                out.write(b"Value = ")
                out.write(bytes(i))
                out.write(b" ")

            # Get string.
            out.seek(0)
            contents = out.read()
            # Test first letter.
            if contents[0:5] != b'Value':
                raise Error


def bench_stringIO():
    # 1. Use StringIO.
    out = io.StringIO()
    for i in range(0, 100):
        out.write(u"Value = ")
        out.write(unicode(i))
        out.write(u" ")

    # Get string.
    contents = out.getvalue()
    out.close()
    # Test first letter.
    if contents[0] != 'V':
        raise Error


def bench_concat():
    # 2. Use string appends.
    data = ""
    for i in range(0, 100):
        data += u"Value = "
        data += unicode(i)
        data += u" "
    # Test first letter.
    if data[0] != u'V':
        raise Error


if __name__ == '__main__':
    print(str(timeit.timeit('bench_temporaryfile()', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")
    print(str(timeit.timeit('bench_spooledtemporaryfile()', setup="from __main__ import bench_spooledtemporaryfile", number=1000)) + " SpooledTemporaryFile")
    print(str(timeit.timeit('bench_BufferedRandom()', setup="from __main__ import bench_BufferedRandom", number=1000)) + " BufferedRandom")
    print(str(timeit.timeit("bench_stringIO()", setup="from __main__ import bench_stringIO", number=1000)) + " io.StringIO")
    print(str(timeit.timeit("bench_concat()", setup="from __main__ import bench_concat", number=1000)) + " concat")

编辑Python3.4.3 + io.BytesIO

python3 ./src/ZOCache/tmp_benchmark.py 
2.689500024644076e-05 TemporaryFile
0.30429405899985795 SpooledTemporaryFile
0.348170792000019 BufferedRandom
0.0764778530001422 io.BytesIO
0.05162201000030109 concat

io.BytesIO的新来源:

#!/usr/bin/python3
from __future__ import print_function
import io
import timeit
import tempfile


class Error(Exception):
    pass


def bench_temporaryfile():
    with tempfile.TemporaryFile() as out:
        for i in range(0, 100):
            out.write(b"Value = ")
            out.write(bytes(str(i), 'utf-8'))
            out.write(b" ")

        # Get string.
        out.seek(0)
        contents = out.read()
        out.close()
        # Test first letter.
        if contents[0:5] != b"Value":
            raise Error


def bench_spooledtemporaryfile():
    with tempfile.SpooledTemporaryFile(max_size=10*1024*1024) as out:
        for i in range(0, 100):
            out.write(b"Value = ")
            out.write(bytes(str(i), 'utf-8'))
            out.write(b" ")

        # Get string.
        out.seek(0)
        contents = out.read()
        out.close()
        # Test first letter.
        if contents[0:5] != b"Value":
            raise Error


def bench_BufferedRandom():
    # 1. BufferedRandom
    with io.open('out.bin', mode='w+b') as fp:
        with io.BufferedRandom(fp, buffer_size=10*1024*1024) as out:
            for i in range(0, 100):
                out.write(b"Value = ")
                out.write(bytes(i))
                out.write(b" ")

            # Get string.
            out.seek(0)
            contents = out.read()
            # Test first letter.
            if contents[0:5] != b'Value':
                raise Error


def bench_BytesIO():
    # 1. Use StringIO.
    out = io.BytesIO()
    for i in range(0, 100):
        out.write(b"Value = ")
        out.write(bytes(str(i), 'utf-8'))
        out.write(b" ")

    # Get string.
    contents = out.getvalue()
    out.close()
    # Test first letter.
    if contents[0:5] != b'Value':
        raise Error


def bench_concat():
    # 2. Use string appends.
    data = ""
    for i in range(0, 100):
        data += "Value = "
        data += str(i)
        data += " "
    # Test first letter.
    if data[0] != 'V':
        raise Error


if __name__ == '__main__':
    print(str(timeit.timeit('bench_temporaryfile()', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")
    print(str(timeit.timeit('bench_spooledtemporaryfile()', setup="from __main__ import bench_spooledtemporaryfile", number=1000)) + " SpooledTemporaryFile")
    print(str(timeit.timeit('bench_BufferedRandom()', setup="from __main__ import bench_BufferedRandom", number=1000)) + " BufferedRandom")
    print(str(timeit.timeit("bench_BytesIO()", setup="from __main__ import bench_BytesIO", number=1000)) + " io.BytesIO")
    print(str(timeit.timeit("bench_concat()", setup="from __main__ import bench_concat", number=1000)) + " concat")

每个平台都是如此吗?如果是这样,为什么呢?

编辑:固定基准(和固定代码)的结果:

0.2675984420002351 TemporaryFile
0.28104681999866443 SpooledTemporaryFile
0.3555715570000757 BufferedRandom
0.10379689100045653 io.BytesIO
0.05650951399911719 concat

1 个答案:

答案 0 :(得分:7)

你最大的问题:Per tdelaney,你从未真正开始TemporaryFile测试;你省略了timeit代码段中的parens(仅用于该测试,其他实际运行)。因此,您需要计算查找名称bench_temporaryfile所需的时间,但不要实际调用它。变化:

print(str(timeit.timeit('bench_temporaryfile', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")

为:

print(str(timeit.timeit('bench_temporaryfile()', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")

(添加parens以使其成为一个电话)来修复。

其他一些问题:

io.StringIO与您的其他测试用例根本不同。具体来说,您正在测试的所有其他类型都以二进制模式运行,读取和写入str,并避免行结束转换。 io.StringIO使用Python 3样式字符串(Python 2中的unicode),您的测试通过使用不同的文字并转换为unicode而不是bytes来确认。这增加了大量的编码和解码开销,以及使用更多的内存(unicode使用相同数据的str内存的2-4倍,这意味着更多的分配器开销,更多的复制开销等等。)。

另一个主要区别是你为bufsize设置了一个真正巨大的TemporaryFile;需要进行少量系统调用,并且大多数写操作只是附加到缓冲区中的连续内存。相比之下,io.StringIO存储了所写的各个值,并且仅在您使用getvalue()请求它们时将它们连接在一起。

另外,最后,您认为使用bytes构造函数正在向前兼容,但您不是;在Python 2中bytesstr的别名,因此bytes(10)返回'10',但在Python 3中,bytes是一个完全不同的东西,并通过了integer返回给该大小的零初始化bytes对象,bytes(10)返回b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

如果您想要公平的测试用例,请至少切换到cStringIO.StringIOio.BytesIO而不是io.StringIO并统一编写bytes。通常,您不会自己为TemporaryFile等明确设置缓冲区大小,因此您可以考虑删除它。

在我自己的使用Python 2.7.10的Linux x64测试中,使用ipython的%timeit魔法,排名是:

  1. io.BytesIO每循环48μs
  2. io.StringIO每个循环〜54μs(因此unicode开销不会增加太多)
  3. cStringIO.StringIO每循环约83μs
  4. 每个循环
  5. TemporaryFile ~2.8 ms (注意单位; ms比μs长1000倍)
  6. 而且没有回到默认缓冲区大小(我保留了测试中的显式bufsize)。我怀疑TemporaryFile的行为会有很大变化(取决于操作系统和临时文件的处理方式;某些系统可能只存储在内存中,其他系统可能存储​​在/tmp中,但当然,{ {1}}无论如何都可能只是一个RAMdisk。

    有些东西告诉我你可能有一个设置,其中/tmp基本上是一个普通的内存缓冲区,永远不会进入文件系统,我的最终可能最终会持久存储(如果只是短期);在内存中发生的事情是可预测的,但是当你涉及文件系统(TemporaryFile可以,取决于操作系统,内核设置等)时,系统之间的行为会有很大不同。