Question

我在python中使用openpyxl，我正在尝试运行50k行并从每行中获取数据并将其放入文件中。然而......我发现的是，当我进入它时，它的运行速度越来越慢。第一条1k线超快速，不到一分钟，但在此之后，下一条1k线需要更长，更长，更长的时间。

我正在打开.xlsx文件。我想知道将.txt文件作为csv或其他东西打开或读取json文件是否更快？或者以某种方式转换为更快读取的东西？

我在给定列中有20个唯一值，然后每个值的值都是随机的。我正在尝试为每个值获取整个唯一值列的字符串。

价值1：1243,345,34,124，价值2：1243,345,34,124，等等

我正在浏览“值”列表，查看文件中是否存在该名称，如果存在，则会访问该文件并向其追加新值，如果该文件不存在，则会创建该文件，然后将其设置为追加。我有一个字典，其中包含所有“追加写入文件”的东西，所以任何时候我想写一些东西，它会获取文件名，并且附加的东西将在dict中可用，它会查找它并且写入该文件，因此每次运行时都不会继续打开新文件。

第一个1k花了不到一分钟...现在我的4k到5k记录，并且它已经准备好运行5分钟......它似乎需要更长时间，因为它在记录中上升，我想知道如何加速它起来。它根本不打印到控制台。

writeFile = 1
theDict = {}

for row in ws.iter_rows(rowRange):
    for cell in row:
        #grabbing the value
        theStringValueLocation = "B" + str(counter)
        theValue = ws[theStringValueLocation].value
        theName = cell.value
        textfilename = theName + ".txt"

        if os.path.isfile(textfilename):
            listToAddTo = theDict[theName]
            listToAddTo.write("," + theValue)
            if counter == 1000:
                print "1000"
                st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')

        else:
            writeFileName = open(textfilename, 'w')
            writeFileName.write(theValue)
            writeFileName = open(textfilename, 'a')
            theDict[theName] = writeFileName
        counter = counter + 1

我在上面的代码中添加了一些时间戳，它不存在，但你可以看到下面的输出。我看到的问题是，每增加1k就会越来越高。 2分钟的第一次，3分钟，然后是5分钟，然后是7分钟。当它达到50k时，我担心它会花费一个小时或者其他东西，这将花费太长时间。

1000
2016-02-25 15:15:08
20002016-02-25 15:17:07
30002016-02-25 15:20:52
2016-02-25 15:25:28
4000
2016-02-25 15:32:00
5000
2016-02-25 15:40:02
6000
2016-02-25 15:51:34
7000
2016-02-25 16:03:29
8000
2016-02-25 16:18:52
9000
2016-02-25 16:35:30
10000

我应该说清楚的事情......我不知道提前值的名称，也许我应该运行并在单独的python脚本中抓取它们以使其更快？

其次，我需要一个由逗号分隔的所有值的字符串，这就是我将其放入文本文件以便稍后获取的原因。我正在考虑按照向我建议的列表进行操作，但我想知道是否会出现同样的问题。我认为这个问题与读取excel有关。无论如何，我可以用逗号分隔它的字符串，我可以用另一种方式。

或许我可以尝试/ catch而不是每次都搜索文件，如果有错误，我可以假设创建一个新文件？也许每次查找都会让它变得非常慢？如果文件存在？

这个问题是我原版的延续，我从那里得到了一些建议.... What is the fastest performance tuple for large data sets in python?

Answer 1

我认为你要做的就是从行的B列中获取一个键，然后使用它来追加文件名。让我们加快速度：

from collections import defaultdict
Value_entries = defaultdict(list) # dict of lists of row data

for row in ws.iter_rows(rowRange):
    key = row[1].value

    Value_entries[key].extend([cell.value for cell in row])

# All done. Now write files:
for key in Value_entries.keys():
    with open(key + '.txt', 'w') as f:
        f.write(','.join(Value_entries[key]))

Answer 2

看起来你只想要来自B柱的细胞。在这种情况下，您可以使用ws.get_squared_range()来限制要查看的单元格数。

for row in ws.get_squared_range(min_col=2, max_col=2, min_row=1, max_row=ws.max_row):
    for cell in row: # each row is always a sequence
         filename = cell.value
         if os.path.isfilename(filename):
              …

我们不清楚您的代码的else分支发生了什么，但是您应该在完成后立即关闭所有文件。

Answer 3

根据您链接到的其他问题以及上面的代码，您会看到一个名称 - 值对的电子表格。列A中的名称和值在列B中。名称可以在列A中多次出现，并且每次在列B中可以有不同的值。目标是创建显示每个名称的所有值的列表。

首先，对上面的代码进行一些观察：

counter永远不会被初始化。据推测它被初始化为1。
open(textfilename,...)被调用两次而不关闭文件。调用open会分配一些内存来保存与文件操作相关的数据。为第一次打开调用分配的内存可能要到很晚才被释放，可能直到程序结束。完成后关闭文件是更好的做法（请参阅使用open作为上下文管理器）。
循环逻辑不正确。考虑：

内循环的第一次迭代：

for cell in row:                        # cell refers to A1
    valueLocation = "B" + str(counter)  # valueLocation is "B1"
    value = ws[valueLocation].value     # value gets contents of cell B1
    name = cell.value                   # name gets contents of cell A1
    textfilename = name + ".txt"
    ...
    opens file with name based on contents of cell A1, and
    writes value from cell B1 to the file
    ...
    counter = counter + 1                        # counter = 2

但每行至少有两个单元格，所以在内循环的第二次迭代中：

for cell in row:                          # cell now refers to cell B1
    valueLocation = "B" + str(counter)    # valueLocation is "B2"
    value = ws[valueLocation].value       # value gets contents of cell B2
    name = cell.value                     # name gets contents of cell B1
    textfilename = name + ".txt"
    ...
    opens file with name based on contents of cell "B1"  <<<< wrong file
    writes the value of cell "B2" to the file            <<<< wrong value
    ...
    counter = counter + 1        # counter = 3 when cell B1 is processed

对每个50K行重复一次。根据B列中的唯一值的数量，程序可能试图拥有数百或数千个打开的文件（基于单元格A1，B1，A2，B2，...的内容）==＆gt;＆gt;非常慢或程序崩溃。

iter_rows()会返回行中单元格的元组。
正如人们在另一个问题中建议的那样，使用字典和列表来存储值并在最后将它们全部写出来。像这样（我使用的是python 3.5，所以如果使用2.7，你可能需要调整它）

这是一个直接的解决方案：

from collections import defaultdict

data = defaultdict(list)

# gather the values into lists associated with each name
# data will look like { 'name1':['value1', 'value42', ...],
#                       'name2':['value7', 'value23', ...],
#                       ...}
for row in ws.iter_rows():
    name = row[0].value
    value = row[1].value
    data[name].append(value)

for key,valuelist in data.items():
    # turn list of strings in to a long comma-separated string
    # e.g., ['value1', 'value42', ...] => 'value1,value42, ...'
    value = ",".join(valuelist)

    with open(key + ".txt", "w") as f:
        f.write(value)

在OpenPYXL中运行50k行Excel文件的最快方法

3 个答案: