Question

有效地（性能和内存）提取工作表名称和给定工作表的Python3选项，以及从非常大的.xlsx文件中提取列的名称是什么？

我尝试使用熊猫：

对于使用pd.ExcelFile的工作表名称：

    xl = pd.ExcelFile(filename)
    return xl.sheet_names

对于使用pd.ExcelFile的列名：

    xl = pd.ExcelFile(filename)
    df = xl.parse(sheetname, nrows=2, **kwargs)
    df.columns

对于使用pd.read_excel和不使用nrows（> v23）的列名称：

    df = pd.read_excel(io=filename, sheet_name=sheetname, nrows=2)
    df.columns

但是，pd.ExcelFile和pd.read_excel似乎都在读取内存中的整个.xlsx，因此速度很慢。

非常感谢！

Answer 1

这是我与您分享的最简单方法：

# read the sheet file
import pandas as pd
my_sheets = pd.ExcelFile('sheet_filename.xlsx')
my_sheets.sheet_names

Answer 2

根据this SO question，不支持分块读取excel文件（see this issue on github），使用nrows将始终首先将所有文件读取到内存中。

可能的解决方案：

将工作表转换为csv，并分块阅读。
使用熊猫以外的物品。有关备用库的列表，请参见this page。

Answer 3

此程序列出了excel中的所有工作表。在这里使用熊猫。

import pandas as pd
with pd.ExcelFile('yourfile.xlsx') as xlsx :
    sh=xlsx.sheet_names
print("This workbook has the following sheets : ",sh)

Answer 4

我认为这会帮助满足需求

from openpyxl import load_workbook

workbook = load_workbook(filename, read_only=True)

data = {}   #for storing the value of sheet with their respective columns

for sheet in worksheets:
    for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
        data[sheet.title] = value #value would be a tuple with headings of each column

使用Python3从大型.xlsx高效地提取工作表名称和列名称

4 个答案: