当我尝试以二进制格式训练大型数据集NIST19时,我遇到了问题。它是13gb,我用它来训练它几分钟后输出内存错误。
这是脚本:
#python train.py --dataset data/NIST19.csv --model models/svm.cpickle
from sklearn.externals import joblib
from sklearn.svm import LinearSVC
from hog import HOG
import dataset
import argparse
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
help = "path to the dataset file")
ap.add_argument("-m", "--model", required = True,
help = "path to where the model will be stored")
args = vars(ap.parse_args())
(digits, target) = dataset.load_digits(args["dataset"])
data = []
hog = HOG(orientations = 18, pixelsPerCell = (10, 10),
cellsPerBlock = (1, 1), transform = True)
for image in digits:
image = dataset.deskew(image, 20)
image = dataset.center_extent(image, (20, 20))
hist = hog.describe(image)
data.append(hist)
model = LinearSVC(random_state = 42)
model.fit(data, target)
joblib.dump(model, args["model"])
如果您想尝试,可以找到dataset here。 它是.txt二进制数据集,我在.csv
中转换