Question

我有一个Excel电子表格，我需要每天导入SQL Server。该电子表格将包含约50列的约250,000行。我使用 openpyxl 和 xlrd 使用几乎相同的代码对其进行了测试。

这是我正在使用的代码（减去调试语句）：

import xlrd
import openpyxl

def UseXlrd(file_name):
    workbook = xlrd.open_workbook(file_name, on_demand=True)
    worksheet = workbook.sheet_by_index(0)
    first_row = []
    for col in range(worksheet.ncols):
        first_row.append(worksheet.cell_value(0,col))
    data = []
    for row in range(1, worksheet.nrows):
        record = {}
        for col in range(worksheet.ncols):
            if isinstance(worksheet.cell_value(row,col), str):
                record[first_row[col]] = worksheet.cell_value(row,col).strip()
            else:
                record[first_row[col]] = worksheet.cell_value(row,col)
        data.append(record)
    return data


def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    first_row = []
    for col in range(1,sheet.max_column+1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,sheet.max_row+1):
        record = {}
        for col in range(sheet.max_column):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data

xlrd_results = UseXlrd('foo.xls')
openpyxl_resuts = UseOpenpyxl('foo.xls')

传递包含3500行的相同Excel文件会产生截然不同的运行时间。使用xlrd我可以在2秒内将整个文件读入字典列表。使用openpyxl我得到以下结果：

Reading Excel File...
Read 100 lines in 114.14509415626526 seconds
Read 200 lines in 471.43183994293213 seconds
Read 300 lines in 982.5288782119751 seconds
Read 400 lines in 1729.3348784446716 seconds
Read 500 lines in 2774.886833190918 seconds
Read 600 lines in 4384.074863195419 seconds
Read 700 lines in 6396.7723388671875 seconds
Read 800 lines in 7998.775000572205 seconds
Read 900 lines in 11018.460735321045 seconds

虽然我可以在最终脚本中使用xlrd，但由于各种问题，我将不得不对大量格式进行硬编码（即int读取为float，date读取为int，datetime读取为float）。由于我需要将这些代码重复用于更多的导入，因此尝试对特定列进行硬编码以正确格式化它们并且必须在4个不同的脚本中维护类似的代码是没有意义的。

关于如何进行的任何建议？

Answer 1

你可以在表格上iterate：

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [cell.value for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = cell.value
        data.append(record)
    return data

这应该扩展到大文件。如果列表，您可能希望将结果分块 data太大了。

现在openpyxl版本大约是xlrd版本的两倍：

%timeit xlrd_results = UseXlrd('foo.xlsx')
1 loops, best of 3: 3.38 s per loop

%timeit openpyxl_results = UseOpenpyxl('foo.xlsx')
1 loops, best of 3: 6.87 s per loop

请注意，xlrd和openpyxl可能会解释什么是整数，什么是float稍微不同。对于我的测试数据，我需要添加float()以使输出具有可比性：

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [float(cell.value) for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = float(cell.value)
        data.append(record)
    return data

现在，两个版本都为我的测试数据提供了相同的结果：

>>> xlrd_results == openpyxl_results
True

Answer 2

听起来像Pandas模块的完美候选人：

import pandas as pd
import sqlalchemy
import pyodbc

# pyodbc
#
# assuming the following:
# username: scott
# password: tiger
# DSN: mydsn
engine = create_engine('mssql+pyodbc://scott:tiger@mydsn')

# pymssql
#
#engine = create_engine('mssql+pymssql://scott:tiger@hostname:port/dbname')


df = pd.read_excel('foo.xls')

# write the DataFrame to a table in the sql database
df.to_sql("table_name", engine)

DataFrame.to_sql()功能

的说明

PS它应该非常快速且易于使用

Answer 3

您多次调用“ sheet.max_column”或“ sheet.max_row”。不要那样做只需调用一次。如果在for循环上调用它，则每次都会计算一次max_column或max_row。

我修改如下以供参考：

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    max_col = sheet.max_column
    max_row = sheet.max_row
    first_row = []
    for col in range(1,max_col +1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,max_row +1):
        record = {}
        for col in range(max_col):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data

与xlrd

3 个答案: