大数据集的内存错误(Python,NIST19)

时间:2017-11-06 14:55:42

标签: python memory dataset

当我尝试以二进制格式训练大型数据集NIST19时,我遇到了问题。它是13gb,我用它来训练它几分钟后输出内存错误。

这是脚本:

#python train.py --dataset data/NIST19.csv --model models/svm.cpickle

from sklearn.externals import joblib
from sklearn.svm import LinearSVC
from hog import HOG
import dataset
import argparse

ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
    help = "path to the dataset file")
ap.add_argument("-m", "--model", required = True,
    help = "path to where the model will be stored")
args = vars(ap.parse_args())
(digits, target) = dataset.load_digits(args["dataset"])
data = []

hog = HOG(orientations = 18, pixelsPerCell = (10, 10),
    cellsPerBlock = (1, 1), transform = True)

for image in digits:
    image = dataset.deskew(image, 20)
    image = dataset.center_extent(image, (20, 20))
    hist = hog.describe(image)
    data.append(hist)

model = LinearSVC(random_state = 42)
model.fit(data, target)
joblib.dump(model, args["model"])

如果您想尝试,可以找到dataset here。 它是.txt二进制数据集,我在.csv

中转换

0 个答案:

没有答案