我使用Python3.6在包含超过2000万行的6gb文件上运行下面的脚本。
我有8gb Ram。
我得到了一个MemoryError。
该脚本适用于格式相同的较小文件。
import pandas as pd
import numpy as np
cols_to_read = ['CashAccountReference', 'SecurityIdentifier', 'SignOfMovement', 'TransactionValue']
filename = 'C:/Users/User/Desktop/1/Data.csv'
data = pd.read_csv(filename,encoding='cp1252')
useful= np.array(data.get(cols_to_read))
B = np.array(data['CashAccountReference'])
#I - data['SecurityIdentifier']
K = round(data['SignOfMovement'],4)
O = round(data['TransactionValue'],4)
num = len(useful)
sums = {};
result = {};
for l in range(num):
key = useful[l, 0]
val = round(useful[l, 2] * useful[l, 3], 4)
idstr = useful[l,0]
id = idstr[7:11]
if sums.get(key) is None:
sums[key] = val
else:
sums[key] = round(sums.get(key),4) + round(val,4)
result[key] = [useful[l,0], useful[l,1], sums[key], id]
datalist = []
for j in result.keys():
datalist.append(result.get(j))
dat = pd.DataFrame(datalist)
dat.to_csv('C:/Users/User/Desktop/1/output.csv', index=False)
print('done')
我在读完错误之后还创建了这个脚本的另一个实例,但它非常慢并且已经运行了8个小时,并且不知道它是否真的在做任何事情:
import pandas as pd
import numpy as np
cols_to_read = ['CashAccountReference', 'SecurityIdentifier', 'SignOfMovement', 'TransactionValue']
filename = 'C:/Users/k.ahmed/Desktop/1/cashTransactions.csv'
data1 = []
for chunk in pd.read_csv(filename,encoding='cp1252',chunksize=20000):
data1.append(chunk)
data = pd.concat(data1, axis=0)
del data1
useful= np.array(data.get(cols_to_read))
B = np.array(data['CashAccountReference'])
#I - data['SecurityIdentifier']
K = round(data['SignOfMovement'],4)
O = round(data['TransactionValue'],4)
num = len(useful)
sums = {};
result = {};
for l in range(num):
key = useful[l, 0]
val = round(useful[l, 2] * useful[l, 3], 4)
idstr = useful[l,0]
id = idstr[7:11]
if sums.get(key) is None:
sums[key] = val
else:
sums[key] = round(sums.get(key),4) + round(val,4)
result[key] = [useful[l,0], useful[l,1], sums[key], id]
datalist = []
for j in result.keys():
datalist.append(result.get(j))
dat = pd.DataFrame(datalist)
dat.to_csv('C:/Users/k.ahmed/Desktop/1/MPCash.csv', index=False)
print('done')
我如何有效地处理这个大文件,我在代码中做错了什么?