Question

我现在使用PyExcelerator来阅读excel文件，但它非常慢。因为我总是需要打开超过100MB的excel文件，所以只需要加载一个文件就需要20多分钟。

我需要的功能是：

打开Excel文件，选择特定表格，然后将它们加载到Dict或List对象中。
有时：选择特定列并仅加载具有特定值的特定列的整行。
读取受密码保护的Excel文件。

我现在使用的代码是：

book = pyExcelerator.parse_xls(filepath)
parsed_dictionary = defaultdict(lambda: '', book[0][1])
number_of_columns = 44
result_list = []
number_of_rows = 500000
for i in range(0, number_of_rows):
    ok = False
    result_list.append([])
    for h in range(0, number_of_columns):
        item = parsed_dictionary[i,h]
        if type(item) is StringType or type(item) is UnicodeType:
            item = item.replace("\t","").strip()
        result_list[i].append(item)
        if item != '':
            ok = True
    if not ok:
        break

有什么建议吗？

Answer 1

pyExcelerator似乎无法维护。要编写xls文件，请使用xlwt，它是pyExcelerator的一个分支，具有错误修复和许多增强功能。从xlwt中消除了pyExcelerator的（非常基本的）xls读取能力。要读取xls文件，请使用xlrd。

如果要花20分钟加载100MB xls文件，则必须使用以下一个或多个：慢速计算机，可用内存极少的计算机或旧版本的Python。

pyExcelerator和xlrd都不读取受密码保护的文件。

这是a link that covers xlrd and xlwt。

免责声明：我是xlrd的作者和xlwt的维护者。

Answer 2

xlrd非常适合阅读文件，xlwt非常适合写作。根据我的经验，两者都优于pyExcelerator。

Answer 3

您可以尝试在单个语句中将列表预分配到其大小，而不是像这样一次追加一个项目:(一个大的内存分配应该比许多小内容快）

book = pyExcelerator.parse_xls(filepath)
parsed_dictionary = defaultdict(lambda: '', book[0][1])
number_of_columns = 44
number_of_rows = 500000
result_list = [] * number_of_rows 
for i in range(0, number_of_rows):
    ok = False
    #result_list.append([])
    for h in range(0, number_of_columns):
        item = parsed_dictionary[i,h]
        if type(item) is StringType or type(item) is UnicodeType:
            item = item.replace("\t","").strip()
        result_list[i].append(item)
        if item != '':
            ok = True
    if not ok:
        break

如果这样做会带来明显的性能提升，您还可以尝试使用列数预先分配每个列表项，然后按索引分配它们，而不是一次附加一个值。这是一个片段，它在一个语句中创建一个10x10的二维列表，初始值为0：

L = [[0] * 10 for i in range(10)]

如此折叠到您的代码中，它可能会像这样：

book = pyExcelerator.parse_xls(filepath)
parsed_dictionary = defaultdict(lambda: '', book[0][1])
number_of_columns = 44
number_of_rows = 500000
result_list = [[''] * number_of_rows for x in range(number_of_columns)]
for i in range(0, number_of_rows):
    ok = False
    #result_list.append([])
    for h in range(0, number_of_columns):
        item = parsed_dictionary[i,h]
        if type(item) is StringType or type(item) is UnicodeType:
            item = item.replace("\t","").strip()
        result_list[i,h] = item
        if item != '':
            ok = True
    if not ok:
        break

Answer 4

与您的问题无关：如果您正在尝试检查是否所有列都是空字符串，那么您最初设置ok = True，并在内部循环中执行此操作（ ok = ok and item != ''）。此外，您可以使用isinstance(item, basestring)来测试变量是否为字符串。

修订版

for i in range(0, number_of_rows):
    ok = True
    result_list.append([])
    for h in range(0, number_of_columns):
        item = parsed_dictionary[i,h]
        if isinstance(item, basestring):
            item = item.replace("\t","").strip()
        result_list[i].append(item)
        ok = ok and item != ''

    if not ok:
        break

Answer 5

我最近建立了一个有趣的图书馆：https://github.com/ktr/sxl。本质上，它试图像Python一样“流化” Excel文件与普通文件，因此，当您只需要数据子集（尤其是在文件开头附近）时，非常非常快。< / p>

如何在Python中快速打开excel文件？

5 个答案: