我对使用openpyxl收集有关我拥有的大型数据集的关键指标感兴趣。我感兴趣的两件事是基数和字段重要性(即多少" null"或"垃圾"值我们有这个领域)。我遇到了性能问题,并想知道我的代码是否有任何优化方式。我最大的excel文件有大约20,000行。我知道openpyxl的优化阅读器,但我需要查看每个单元格并获得它的价值。
我的脚本正在从大型xlsx文件中读取数据并写入包含每个字段信息的google doc。
def run(table, limit_percent_null):
excel_workbook = load_workbook(filename = settings.mypath + table + '.xlsx', read_only=True)
excel_sheet = excel_workbook.worksheets[0]
d = dict()
# first loop through our fields
for i in range(1, excel_sheet.get_highest_column()):
key = excel_sheet.cell(row = 1, column = i).value
if key is None:
break;
# key is the field and value is list of booleans
# true = null or empty, false = has an actual value
d[key] = []
# low loop through actual values of those fields
for j in range(2, excel_sheet.get_highest_row()):
field = excel_sheet.cell(row = j, column = i).value
# does the field have "null" in it or is empty?
if field is None:
d[key].append(True)
else:
d[key].append(True if "null" in str(field) else False)
# write to google doc
google_sheet = settings.open_gspread_connetion(table)
for key, value in d.items():
# omitted