我正在尝试打开一个具有4605227行(305 MB)的txt文件
我之前做过的事情是:
data = np.loadtxt('file.txt', delimiter='\t', dtype=str, skiprows=1)
df = pd.DataFrame(data, columns=["a", "b", "c", "d", "e", "f", "g", "h", "i"])
df = df.astype(dtype={"a": "int64", "h": "int64", "i": "int64"})
但是它用完了大约10GB的可用ram,并且还没有完成。 有没有更快的方法来读取此txt文件并创建熊猫数据框?
谢谢!
编辑: 现在解决了,谢谢。为什么np.loadtxtx()这么慢?
答案 0 :(得分:1)
与其使用numpy进行读取,不如直接将其作为Pandas DataFrame读取。例如,使用pandas.read_csv函数,例如:
df = pd.read_csv('file.txt', delimiter='\t', usecols=["a", "b", "c", "d", "e", "f", "g", "h", "i"])
答案 1 :(得分:0)
方法1:
您可以按块读取文件,而且您可以在readline中提到一个缓冲区大小,您可以读取。
inputFile = open('inputTextFile','r')
buffer_line = inputFile.readlines(BUFFERSIZE)
while buffer_line:
#logic goes here
方法2:
您也可以使用nmap Module,下面是将说明用法的链接。
导入mmap
with open("hello.txt", "r+b") as f:
# memory-map the file, size 0 means whole file
mm = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print(mm.readline()) # prints b"Hello Python!\n"
# read content via slice notation
print(mm[:5]) # prints b"Hello"
# update content using slice notation;
# note that new content must have same size
mm[6:] = b" world!\n"
# ... and read again using standard file methods
mm.seek(0)
print(mm.readline()) # prints b"Hello world!\n"
# close the map
mm.close()
答案 2 :(得分:0)
您直接以Pandas DataFrame的形式阅读。例如
import pandas as pd
pd.read_csv(path)
如果您想阅读更快,可以使用modin:
import modin.pandas as pd
pd.read_csv(path)