我运行以下代码:
traindata = trainData.read_csv('train.tsv', delimiter = '\t')
调用此函数:
def read_csv(self, filename, delimiter = ',', quotechar = '"'):
# open the file
reader = csv.reader(open(filename, 'rb'), delimiter = delimiter, quotechar = quotechar)
# read first line and extract its data
self.column_headings = np.array(next(reader))
# read subsequent lines
rows = []
for row in reader:
rows.append(row)
self.data = np.array(rows)
self.m, self.n = self.data.shape
这样我就可以打电话了
m, n = traindata.data.shape
print m, n, traindata.column_headings
不幸的是,在我对read_csv
函数的调用中,我得到错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-74-1cc5776f9a9c> in <module>()
13 print "loading data.."
14
---> 15 traindata = trainData.read_csv('test.tsv', delimiter = '\t')
16
C:\pc in read_csv(self, filename, delimiter, quotechar)
17 for row in reader:
18 rows.append(row)
---> 19 self.data = np.array(rows)
20 self.m, self.n = self.data.shape
21
ValueError: array is too big.
如何修复此行为并允许代码运行?
编辑:数据是.tsv文件,extract here.
答案 0 :(得分:7)
Numpy正在创建一个巨大的字符串数组,每个字符串的长度设置为该列中任何一个字符串的最大长度,并且在这个大量内存分配的中间你可能已经没有ram了。
通过
self.data = np.array(rows, dtype=object)
numpy不需要为字符串对象分配大块新内存 - dtype=object
告诉numpy将其数组内容保留为对现有python对象的引用(字符串已存在于python列表中{{1}这些指针占用的空间比字符串对象少得多。