与numpy和csv阅读器的Python内存错误

时间:2014-08-06 03:50:27

标签: python arrays csv numpy

我正在阅读两个大型csv文件,每个文件有315,000行和300列。我希望在使用python时阅读所有这些,但现在遇到大约50,000行的内存问题。我有大约4GB的RAM,每个csv文件是1.5千兆。我打算尝试一下亚马逊的网络服务,但是iF的任何人都有关于阅读文件的优化技巧的建议,我喜欢省钱!

这里的前2 / 314,000行的样本数据: https://drive.google.com/file/d/0B0MhJ7rn5OujR19LLVYyUFF5MVE/edit?usp=sharing

我的Python(xy)Spyder控制台出现以下错误:

for row in getstuff(filename): (line 97)
for row in getdata("test.csv"): (line 89)
MemoryError

我还尝试按照建议的评论执行以下操作,但仍然收到内存错误:

for row in getdata("train.csv"):                        
   data.append(row[0::])

np.array(data)

以下代码:

import csv
from xlrd import open_workbook 
from xlutils.copy import copy 
import numpy as np
import time
from sklearn.ensemble import RandomForestClassifier
from numpy import savetxt
from sklearn.feature_extraction import DictVectorizer
from xlwt import *


t0=time.clock()
data=[]
data1=[]

count=0
print "Initializing..."

def getstuff(filename):
  with open(filename, "rb") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if count<100000:
            yield row
            count += 1
        elif count > 100000:
            return
        else:
            return

def getdata(filename):
  for row in getstuff(filename):
    yield row


for row in getdata("train.csv"):
   np.array(data.append(row[0::]))


for row in getdata("test.csv"): 
   np.array(data1.append(row[0::]))


target = np.array([x[1] for x in data],dtype=object)
train = np.array([x[2:] for x in data],dtype=object)    
test = np.array([x[1:] for x in data1],dtype=object)    

0 个答案:

没有答案