Python - Pandas在读取大文件时内存不足

时间:2018-06-12 15:24:29

标签: python-3.x pandas numpy

我使用Python3.6在包含超过2000万行的6gb文件上运行下面的脚本。

我有8gb Ram。

我得到了一个MemoryError。

该脚本适用于格式相同的较小文件。

import pandas as pd
import numpy as np



cols_to_read = ['CashAccountReference', 'SecurityIdentifier', 'SignOfMovement', 'TransactionValue']
filename = 'C:/Users/User/Desktop/1/Data.csv'
data = pd.read_csv(filename,encoding='cp1252')
useful= np.array(data.get(cols_to_read))
B = np.array(data['CashAccountReference'])
#I - data['SecurityIdentifier']
K = round(data['SignOfMovement'],4)
O = round(data['TransactionValue'],4)

num = len(useful)

sums = {};
result = {};
for l in range(num):
    key = useful[l, 0]
    val = round(useful[l, 2] * useful[l, 3], 4)
    idstr = useful[l,0]
    id = idstr[7:11]
    if sums.get(key) is None:
        sums[key] = val
    else:
        sums[key] = round(sums.get(key),4) + round(val,4)
    result[key] = [useful[l,0], useful[l,1], sums[key], id]

datalist = []
for j in result.keys():
    datalist.append(result.get(j))

dat = pd.DataFrame(datalist)
dat.to_csv('C:/Users/User/Desktop/1/output.csv', index=False)
print('done')

我在读完错误之后还创建了这个脚本的另一个实例,但它非常慢并且已经运行了8个小时,并且不知道它是否真的在做任何事情:

import pandas as pd
import numpy as np



cols_to_read = ['CashAccountReference', 'SecurityIdentifier', 'SignOfMovement', 'TransactionValue']
filename = 'C:/Users/k.ahmed/Desktop/1/cashTransactions.csv'
data1 = []
for chunk in pd.read_csv(filename,encoding='cp1252',chunksize=20000):
    data1.append(chunk)

data = pd.concat(data1, axis=0)
del data1
useful= np.array(data.get(cols_to_read))
B = np.array(data['CashAccountReference'])
#I - data['SecurityIdentifier']
K = round(data['SignOfMovement'],4)
O = round(data['TransactionValue'],4)

num = len(useful)

sums = {};
result = {};
for l in range(num):
    key = useful[l, 0]
    val = round(useful[l, 2] * useful[l, 3], 4)
    idstr = useful[l,0]
    id = idstr[7:11]
    if sums.get(key) is None:
        sums[key] = val
    else:
        sums[key] = round(sums.get(key),4) + round(val,4)
    result[key] = [useful[l,0], useful[l,1], sums[key], id]

datalist = []
for j in result.keys():
    datalist.append(result.get(j))

dat = pd.DataFrame(datalist)
dat.to_csv('C:/Users/k.ahmed/Desktop/1/MPCash.csv', index=False)
print('done')

我如何有效地处理这个大文件,我在代码中做错了什么?

0 个答案:

没有答案