Question

我有一个相对中等大小的电子表格 - 212行x 56列数据。

我的循环越来越慢，我的搜索越接近电子表格的底部。如果可以快到200毫秒，最多7000毫秒就可以返回响应。

如何加速搜索，使时间至少保持不变或至少显着加速，使其永不超过500毫秒。

以下是我打开电子表格的方式：

wb = openpyxl.load_workbook('data/%s' % filename, read_only=True)
sheet = wb.get_sheet_by_name('Service%s' % service)

这是我的循环：

for i in range(3, sheet.max_row+1):
    if str(sheet.cell(row=i, column=1).value) == country:
        for x in range(2, sheet.max_column+1):
            if weight > float(sheet.cell(row=2, column=sheet.max_column).value):
                abort(404, "Maximum Weight Exceeded for Service Class")

            if weight < float(sheet.cell(row=2, column=2).value):
                return float(sheet.cell(row=i, column=2).value)

            if weight == float(sheet.cell(row=2, column=x).value):
                return float(sheet.cell(row=i, column=x).value)

            if weight < float(sheet.cell(row=2, column=x).value):
                return float(sheet.cell(row=i, column=x).value)

编辑：

在大家的建议之后，我重构了这个方法。它似乎要快得多，但我不确定如何在嵌套在for循环中时访问特定的行。下面的新代码：

if weight > float(sheet.cell(row=2, column=sheet.max_column).value):
    abort(404, "Maximum Weight Exceeded for Service Class")

minweight = float(sheet.cell(row=2, column=2).value)

for row in sheet.rows:
    if row[0].value == country:
        if weight < minweight:
            return row[1].value

        for cell in row[1:]: # skip first item
            if weight <= float(cell.value):
            # This is wrong. I need to compare weight to cell values in the 2nd row
                return float(cell.value)

编辑2 - 现在运行~300ms：

if weight > float(sheet.cell(row=2, column=sheet.max_column).value):
    abort(404, "Maximum Weight Exceeded for Service Class")

minweight = float(sheet.cell(row=2, column=2).value)

ignore_first_row, weight_list = islice(sheet.rows, 0, 2)

for row in islice(sheet.rows, 2, sheet.max_row):
    if row[0].value == country:
        if weight < minweight:
            return row[1].value # return country's min rate

        for ratecell, weightcell in izip(row, weight_list):
            if weight <= float(weightcell.value):
                return float(ratecell.value)

Answer 1

我生成了一个包含57列和200行的1张xlsx文件。每个列栏最后一个包含一个随机生成的100个字符的字符串，最后一列是一个6个字符的任意但非随机的序列，用作搜索目标。

使用sheet.rows的以下代码大约快7倍（350毫秒）：

for row in sheet.rows:
    if str(row[sheet.max_column-1].value) == needle:
        # needle defined to match only the last row
        print 'found'
        break

比你代码的精简版（2400毫秒）：

for i in xrange(1, sheet.max_row+1):
    if str(sheet.cell(row=i, column=sheet.max_column).value) == needle:
        # needle defined to match only the last row
        print 'found'
        break

请注意，我有一个SSD和一个快速处理器 - YMMV取决于硬件和实际数据。除非数据和硬件基本上是常量，否则您无法保证搜索时间少于给定时间。

正如我在评论中所说的那样，你真的应该学会使用cProfile或类似的方法对你的代码进行基准测试。

在我的评论中，我还注意到顺序搜索本身需要更长时间才能在序列中找到匹配。要更改搜索的时间复杂度，您需要更改算法，这意味着以不同方式构造数据（即不使用平面文件）。二进制搜索通常比顺序搜索快得多，但需要排序数据。

取决于您还需要做什么（您需要修改工作表中的数据吗？多久一次？您的数据有多大？是否真的必须保留在Excel工作表中？）可能会进一步大大改善您的搜索，或者根本不改善。

正如CharlieClark在评论中指出的那样，row[-1]可能比row[sheet.max_column-1]更快（或者你可以把它带到循环之外，因为你的行总是相同的长度）并且你不需要如果您希望在这些单元格中使用字符串数据，则将cell.value强制转换为字符串。

更新： sheet.rows是一个返回生成器的属性，至少在v2.3.5中，所以不，除非使用itertools.islice，否则不能对其进行切片。

但是，您可以将返回的生成器存储在变量中，调用.next()两次以检索并存储前两行，然后迭代其余部分。

row_gen_use_once = sheet.rows
# should really try/except for StopIteration in the next() calls in case there are less than two rows, or else check the row count beforehand
first_row = row_gen_use_once.next()
second_row = row_gen_use_once.next()

for row in row_gen_use_once:
    pass # blah blah do stuff
    # now you can access the second row here :)

或者您可以使用enumerate并在循环中保存第二行：

first_row = None
second_row = None

for idx, row in enumerate(sheet.rows):
    if idx == 0:
        first_row = row
    elif idx == 1:
        second_row = row
    else:
        pass
        # blah blah do stuff

这意味着在循环中进行了一些额外的检查，但由于分支预测，它们不会产生太多的开销。

itertools.islice版本，这是我认为最好的解决方案：

from itertools import islice
first_row, second_row = islice(sheet.rows, 0, 2)

for row in islice(sheet.rows, 2, sheet.max_row):
    pass # do stuff

除非您使用的是Python 3，否则只需执行：

first_row, second_row, *other_rows = sheet.rows

for row in other_rows:
    pass # do stuff

Answer 2

以下是我的一些直接想法：

for i in xrange(3, sheet.max_row+1):
    if str(sheet.cell(row=i, column=1).value) == country:

        if weight > float(sheet.cell(row=2, column=sheet.max_column).value):
            abort(404, "Maximum Weight Exceeded for Service Class")
        if weight < float(sheet.cell(row=2, column=2).value):
            return float(sheet.cell(row=i, column=2).value)

        for x in xrange(2, sheet.max_column+1):
            if weight <= float(sheet.cell(row=2, column=x).value):
                return float(sheet.cell(row=i, column=x).value)

这会将两个逻辑检查（<=）和其他两个逻辑检查移到一起

此外，根据您计算weight的位置，此语句应位于代码中的其他位置：

if weight > float(sheet.cell(row=2, column=sheet.max_column).value):
        abort(404, "Maximum Weight Exceeded for Service Class")

它没有使用i或x，所以每次循环播放时都不需要浪费时间检查它

你能澄清一下这个区块应该做什么：

if weight < float(sheet.cell(row=2, column=2).value):
    return float(sheet.cell(row=i, column=2).value)

在你的循环中，weight没有改变。这是一个静态检查，它将使用i的当前值从您的函数返回。根据您展示的代码，它没有意义。

如何从小型电子表格中加快阅读速度？

2 个答案: