Question

我对使用openpyxl收集有关我拥有的大型数据集的关键指标感兴趣。我感兴趣的两件事是基数和字段重要性（即多少＆＃34; null＆＃34;或＆＃34;垃圾＆＃34;值我们有这个领域）。我遇到了性能问题，并想知道我的代码是否有任何优化方式。我最大的excel文件有大约20,000行。我知道openpyxl的优化阅读器，但我需要查看每个单元格并获得它的价值。

我的脚本正在从大型xlsx文件中读取数据并写入包含每个字段信息的google doc。

def run(table, limit_percent_null):

    excel_workbook = load_workbook(filename = settings.mypath + table + '.xlsx', read_only=True)
    excel_sheet = excel_workbook.worksheets[0]

    d = dict()
    # first loop through our fields

    for i in range(1, excel_sheet.get_highest_column()):
        key = excel_sheet.cell(row = 1, column = i).value
        if key is None:
            break;

        # key is the field and value is list of booleans 
        # true = null or empty, false = has an actual value
        d[key] = []

        # low loop through actual values of those fields
        for j in range(2, excel_sheet.get_highest_row()):
            field = excel_sheet.cell(row = j, column = i).value

            # does the field have "null" in it or is empty?
            if field is None:
                d[key].append(True)
            else:
                d[key].append(True if "null" in str(field) else False)

    # write to google doc
    google_sheet = settings.open_gspread_connetion(table)
    for key, value in d.items():
        # omitted

使用openpyxl对大量数据进行大量计算

0 个答案: