Question

我正在处理的程序需要读取数据文件，该数据文件的ASCII可能很大（最大5GB）。格式可能会有所不同，这就是为什么我想出使用readline()的原因，将每一行拆分为仅获取纯条目，然后将它们全部附加到一个大字符串列表中，然后根据某些情况将其分成较小的字符串列表标记词，然后将数据传递到程序内部数据结构以进行进一步的统一处理。除了需要需要更多内存的方式之外，这种方法已经足够好了，我不知道为什么。

所以我写了这个小测试用例，它使您理解我的问题：这里的输入数据是Shakespears Romeo和Juliet的文本（实际上，我希望混合使用字母-数字输入）-请注意，我希望您自己复制数据以保持内容清晰。该脚本生成一个.txt文件，然后使用重新读取。在这种情况下，原始内存大小为 153 KB 。使用...读取该文件。

f.read（）也为您提供一个字符串，其大小为 153 KB 。
f.readlines（）为您提供一个列表，其中每行包含单个字符串，总大小为 420 KB 。
在每个whiespace处分割f.readlines（）的行字符串，并将所有这些单个条目保存在新列表中，将导致 1619 KB 的内存使用。

由于在这种情况下这些数字似乎不是问题，因此 RAM需求增加 t的系数肯定是> 10， GB顺序。

我不知道为什么这样做或如何避免这种情况。从我的理解来看，列表只是指向存储在列表中的所有值的指针的结构（这也是为什么列表上的sys.getsizeof（）给您“错误”结果的原因）。对于它们自己的值，如果我具有“ LONG STRING”或“ LONG” +“ STRING”，则它们在内存上应该没有区别，因为它们都使用相同的字符，这将导致相同的位数/字节。

也许答案很简单，但我确实对这个问题感到困惑，所以我对每一个想法都很感谢。

# step1: http://shakespeare.mit.edu/romeo_juliet/full.html
# step2: Ctrl+A and then Ctrl+C
# step3: Ctrl+V after benchmarkText

benchmarkText = """ >>INSERT ASCII DATA HERE<< """

#=== import modules =======================================
from pympler import asizeof
import sys

#=== open files and safe data to a structure ==============
#-- original memory size
print("\n\nAll memory sizes are in KB:\n")
print("Original string size:")
print(asizeof.asizeof(benchmarkText)/1e3)
print(sys.getsizeof(benchmarkText)/1e3)

#--- write bench mark file
with open('benchMarkText.txt', 'w') as f:
    f.write(benchmarkText)

#--- read the whole file (should always be equal to original size)
with open('benchMarkText.txt', 'r') as f:
    # read the whole file as one string
    wholeFileString = f.read()    
    # check size:
    print("\nSize using f.read():")
    print(asizeof.asizeof(wholeFileString)/1e3)

#--- read the file in a list
listOfWordOrNumberStrings = []
with open('benchMarkText.txt', 'r') as f:
    # safe every line of the file
    listOfLineStrings = f.readlines()
    print("\nSize using f.readlines():")
    print(asizeof.asizeof(listOfLineStrings)/1e3)

    # split every line into the words or punctation marks
    for stringLine in listOfLineStrings:
        line = stringLine[:-1] # get rid of the '\n'
       # line = re.sub('"', '', line) # The final implementation will need this, however for the test case it doesn't matter.
        elemsInLine = line.split()
        for elem in elemsInLine:
            listOfWordOrNumberStrings.append(elem)
    # check size
    print("\nSize after splitting:")
    print(asizeof.asizeof(listOfWordOrNumberStrings)/1e3)

（我知道我在这里使用readlines（）而不是readline（）-我在此测试用例中更改了它，因为我认为它使事情更容易理解。）

如何以一种内存高效的方式逐元素存储读入的数据？

0 个答案: