Question

我正在尝试打开一个具有4605227行（305 MB）的txt文件

我之前做过的事情是：

data = np.loadtxt('file.txt', delimiter='\t', dtype=str, skiprows=1)

df = pd.DataFrame(data, columns=["a", "b", "c", "d", "e", "f", "g", "h", "i"])

df = df.astype(dtype={"a": "int64", "h": "int64", "i": "int64"})

但是它用完了大约10GB的可用ram，并且还没有完成。有没有更快的方法来读取此txt文件并创建熊猫数据框？

谢谢！

编辑：现在解决了，谢谢。为什么np.loadtxtx（）这么慢？

Answer 1

与其使用numpy进行读取，不如直接将其作为Pandas DataFrame读取。例如，使用pandas.read_csv函数，例如：

df = pd.read_csv('file.txt', delimiter='\t', usecols=["a", "b", "c", "d", "e", "f", "g", "h", "i"])

Answer 2

方法1：

您可以按块读取文件，而且您可以在readline中提到一个缓冲区大小，您可以读取。

inputFile = open('inputTextFile','r')
buffer_line = inputFile.readlines(BUFFERSIZE)
while buffer_line:
    #logic goes here

方法2：

您也可以使用nmap Module，下面是将说明用法的链接。

导入mmap

with open("hello.txt", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(mm.readline())  # prints b"Hello Python!\n"
    # read content via slice notation
    print(mm[:5])  # prints b"Hello"
    # update content using slice notation;
    # note that new content must have same size
    mm[6:] = b" world!\n"
    # ... and read again using standard file methods
    mm.seek(0)
    print(mm.readline())  # prints b"Hello  world!\n"
    # close the map
    mm.close()

https://docs.python.org/3/library/mmap.html

Answer 3

您直接以Pandas DataFrame的形式阅读。例如

import pandas as pd
pd.read_csv(path)

如果您想阅读更快，可以使用modin：

import modin.pandas as pd
pd.read_csv(path)

https://github.com/modin-project/modin

在python中读取大型txt文件的有效方法

3 个答案: