我正在阅读两个大型csv文件,每个文件有315,000行和300列。我希望在使用python时阅读所有这些,但现在遇到大约50,000行的内存问题。我有大约4GB的RAM,每个csv文件是1.5千兆。我打算尝试一下亚马逊的网络服务,但是iF的任何人都有关于阅读文件的优化技巧的建议,我喜欢省钱!
这里的前2 / 314,000行的样本数据: https://drive.google.com/file/d/0B0MhJ7rn5OujR19LLVYyUFF5MVE/edit?usp=sharing
我的Python(xy)Spyder控制台出现以下错误:
for row in getstuff(filename): (line 97)
for row in getdata("test.csv"): (line 89)
MemoryError
我还尝试按照建议的评论执行以下操作,但仍然收到内存错误:
for row in getdata("train.csv"):
data.append(row[0::])
np.array(data)
以下代码:
import csv
from xlrd import open_workbook
from xlutils.copy import copy
import numpy as np
import time
from sklearn.ensemble import RandomForestClassifier
from numpy import savetxt
from sklearn.feature_extraction import DictVectorizer
from xlwt import *
t0=time.clock()
data=[]
data1=[]
count=0
print "Initializing..."
def getstuff(filename):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
count = 0
for row in datareader:
if count<100000:
yield row
count += 1
elif count > 100000:
return
else:
return
def getdata(filename):
for row in getstuff(filename):
yield row
for row in getdata("train.csv"):
np.array(data.append(row[0::]))
for row in getdata("test.csv"):
np.array(data1.append(row[0::]))
target = np.array([x[1] for x in data],dtype=object)
train = np.array([x[2:] for x in data],dtype=object)
test = np.array([x[1:] for x in data1],dtype=object)