Question

我有一个中等大小的排序的ascii文本文件，我试图在Python中使用它。我试图决定在什么次搜索时将整个文件读入内存并使用numpy逻辑索引进行搜索而不是使用我使用timeit函数编写的简单二进制搜索函数。为此，我有以下设置

import os
import timeit
import numpy as np


def binsearch(file, label, col=0, start=0, stop=None, linelength=None):

    if linelength is None:
        file.seek(0, os.SEEK_SET)
        file.readline()
        linelength = file.tell()

    if stop is None:
        file.seek(0, os.SEEK_END)
        stop = file.tell()

    stopline = stop // linelength
    startline = start // linelength

    midline = (stopline-startline) // 2 + startline

    mid = midline*linelength

    file.seek(mid, os.SEEK_SET)

    line = file.readline()

    if not line:
        return None

    linelab = int(line.split()[col])

    if linelab == label:
        return line
    elif midline == startline or midline == stopline:
        return None
    elif linelab < label:
        start = mid
        return binsearch(file, label, col=col, start=start, stop=stop, linelength=linelength)
    elif linelab > label:
        stop = mid
        return binsearch(file, label, col=col, start=start, stop=stop, linelength=linelength)


filepath = '/Users/aliounis/UCAC4/u4i/u4xtycho'
data0 = np.genfromtxt(filepath, dtype=np.int, names=['tycid', 'ucacid', 'rnm'])
numsearch = 10000
checks = data0['ucacid'][np.random.randint(0, 259788-1, numsearch)]
del data0

allin = """
data = np.genfromtxt(filepath, dtype=np.int, names=['tycid', 'ucacid', 'rnm'])

locate = checks.reshape(1, -1) ==  data['ucacid'].reshape(-1, 1)

print(data[np.any(locate, axis=1)].shape)
"""

bins = """
file = open(filepath, 'r')
recs = []
dtypes = np.dtype([('tycid', np.int), ('ucacid', np.int), ('rnm', np.int)])
for val in checks:
    line = binsearch(file, val, col=1)
    if line is not None:
        recs.append(np.array([tuple(np.fromstring(line, dtype=np.int, sep=' '))], dtype=dtypes))

print(np.concatenate(recs, axis=0).shape)
file.close()
"""
numattempts = 10
print(timeit.timeit(allin, number=numattempts, globals=globals())/numattempts)
print(timeit.timeit(bins, number=numattempts, globals=globals())/numattempts)

我使用timeit来比较完成每项任务所需的平均时间。我想知道这是否是一个公平的测试，特别是对于numpy实现。 timeit是否清除了每次运行之间的本地内存（即del data del locate调用的每个运行编号之间的allin和timeit？我只是想确保我不会意外地强迫numpy方法在交换中工作，从而真正减慢了事情。

（请注意，numpy数组在加载时会占用大约60MB，因此一次加载它不会进入交换，但如果加载多次，可能会开始推送到交换中。）

Answer 1

由于timeit是在常规Python中实现的，因此很容易看出它的作用：https://hg.python.org/cpython/file/2.7/Lib/timeit.py

要回答这个问题，不，它不会执行del data，因为它不是你传递给timeit的语句或设置方法的一部分。如果您想要这种行为，则应将其添加为setup方法。

在这种特定情况下，您重新分配相同的值，每次都会产生一个新的内存块，因为默认情况下timeit禁用了垃圾收集器。

每次迭代后，timeit是否清除本地内存

1 个答案: