我正在处理处理大量Excel 2007文件的应用程序,并且我正在使用OpenPyXL来执行此操作。 OpenPyXL有两种不同的方法来读取Excel文件 - 一个"普通"将整个文档一次加载到内存中的方法,以及一个使用迭代器逐行读取的方法。
问题在于,当我使用迭代器方法时,我没有得到任何文档元数据,如列宽和行/列数,而且真的需要这个数据。我假设这些数据存储在靠近顶部的Excel文档中,因此不必将整个10MB文件加载到内存中以便访问它。
那么,有没有办法获得行/列数和列宽,而不先将整个文档加载到内存中?
答案 0 :(得分:16)
看一下OpenPyXL的源代码(IterableWorksheet)我已经弄清楚了如何从迭代器工作表中获取列数和行数:
wb = load_workbook(path, use_iterators=True)
sheet = wb.worksheets[0]
row_count = sheet.get_highest_row() - 1
column_count = letter_to_index(sheet.get_highest_column()) + 1
IterableWorksheet.get_highest_column
返回一个字符串,其中包含您可以在Excel中看到的列字母,例如“A”,“B”,“C”等。因此,我还编写了一个函数来将列字母转换为基于零的索引:
def letter_to_index(letter):
"""Converts a column letter, e.g. "A", "B", "AA", "BC" etc. to a zero based
column index.
A becomes 0, B becomes 1, Z becomes 25, AA becomes 26 etc.
Args:
letter (str): The column index letter.
Returns:
The column index as an integer.
"""
letter = letter.upper()
result = 0
for index, char in enumerate(reversed(letter)):
# Get the ASCII number of the letter and subtract 64 so that A
# corresponds to 1.
num = ord(char) - 64
# Multiply the number with 26 to the power of `index` to get the correct
# value of the letter based on it's index in the string.
final_num = (26 ** index) * num
result += final_num
# Subtract 1 from the result to make it zero-based before returning.
return result - 1
我仍然没有弄清楚如何获取列大小,所以我决定在我的应用程序中使用固定宽度的字体并自动缩放列。
答案 1 :(得分:2)
这可能是非常复杂的,我可能会错过显而易见的,但是如果没有OpenPyXL填充Iterable工作表中的column_dimensions(请参阅上面的评论),我可以看到找到列大小而不加载所有内容的唯一方法是解析xml直接:
from xml.etree.ElementTree import iterparse
from openpyxl import load_workbook
wb=load_workbook("/path/to/workbook.xlsx", use_iterators=True)
ws=wb.worksheets[0]
xml = ws._xml_source
xml.seek(0)
for _,x in iterparse(xml):
name= x.tag.split("}")[-1]
if name=="col":
print "Column %(max)s: Width: %(width)s"%x.attrib # width = x.attrib["width"]
if name=="cols":
print "break before reading the rest of the file"
break
答案 2 :(得分:1)
Python 3
import openpyxl as xl
wb = xl.load_workbook("Sample.xlsx", enumerate)
#the 2 lines under do the same.
sheet = wb.get_sheet_by_name('sheet')
sheet = wb.worksheets[0]
row_count = sheet.max_row
column_count = sheet.max_column
#this works fore me.
答案 3 :(得分:1)
使用熊猫的选项。
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
for sheet in sheetnames:
df = xl.parse(sheet)
dimensions = df.shape
print('sheetname', ' --> ', dimensions)
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
df = xl.parse(sheetnames[0]) # [0] get first tab/sheet.
dimensions = df.shape
print(f'sheetname: "{sheetnames[0]}" - -> {dimensions}')
输出sheetname "Sheet1" --> (row count, column count)
答案 4 :(得分:0)
https://pythonhosted.org/pyexcel/iapi/pyexcel.sheets.Sheet.html 请参阅:row_range()获取行范围的实用程序函数
如果使用pyexcel,可以调用row_range获取最大行数。
python 3.4测试通过。