我是一个python初学者,但我正在编写一个使用openpyxl的脚本来读取大型xlsx文件(60000x187)到Numpy数组中以进行一些机器学习。我的代码:
from openpyxl import load_workbook
import re
from numpy import *
wb = load_workbook(filename = 'dataSheet.xlsx', use_iterators = True) #dataSheet.xlsx
ws1 = wb.get_sheet_by_name(name = 'LogFileData')
startCol = 1 #index from 1
startRow = 2 #start at least from 2 because header is in 1st row
endCol = ws1.get_highest_column() #index of last used column, from 1
endRow = ws1.get_highest_row() #index of last used row, indexed from 1
diff = endRow - startRow + 1 #number of rows in the data array
header = [] #contains the column labels
data = zeros((0,endCol), dtype=float64) #2D array that holds the data
#puts the column headers into a list
for row in ws1.get_squared_range(1, 1, endCol, 1): #indexed from 1
for cell in row:
for match in re.findall("<(.*?)>", cell.value):
header.append(match)
#indexed from 1 when using the ws1
#index from 0 when using the Numpy arrays, tempRow, tempPt, data
for index, row in enumerate(ws1.iter_rows(row_offset=1)):
tempRow = zeros((1,0), dtype=float64)
tempPt = zeros((1,1), dtype=float64)
for cell in row:
value = cell.value
if isinstance(value, basestring):
tempPt[0][0] = None
else:
tempPt[0][0]=value
tempRow = hstack((tempRow,tempPt))
data = vstack((data,tempRow))
openpyxl和optimized_reader是最快,最节省空间的方法吗?一位同事提到,当与itertools或类似的软件包一起使用时,csv文件可能会更快。
编辑1: 我的规格 VMWare上的Ubuntu 10.04 LTS Python 2.6.5 英特尔i5四核 2.5GHz的 Windows 7企业版
答案 0 :(得分:3)
我在2009款MacBook上对优化的阅读器进行了基准测试,测试时间为20万,数字为100万个。由于单元格的间接和模式匹配(在循环外部编译模式),我期望您的代码轻微受到影响,但会认为速度仍然可以接受。当然,如果你能轻松搞定,CSV会更快。
有兴趣知道你的号码。
答案 1 :(得分:-3)
读取xlsx工作表的最快方法。
行程超过500k的56mb文件和4张需要6秒才能继续。
import zipfile
from bs4 import BeautifulSoup
paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")
for name in file.namelist():
if name == 'xl/workbook.xml':
data = BeautifulSoup(file.read(name), 'html.parser')
sheets = data.find_all('sheet')
for sheet in sheets:
paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])
for path in paths:
if path[0] == mySheet:
with file.open(path[1]) as reader:
for row in reader:
print(row) ## do what ever you want with your data
reader.close()
享受快乐的编码。