Question

我正在尝试使用scikit提供的许多分类器来构建基本字符识别模型。正在使用的数据集是一组标准的手写字母数字样本（Chars74K图像数据集取自source：EnglishHnd.tgz）。

每个字符共有55个样本（总共62个字母数字字符），每个样本为900x1200像素。我将矩阵（首先转换为灰度）展平为1x1080000阵列（每个代表一个特征）。

for sample in sample_images: # sample images is the list of the .png files
    img = imread(sample);
    img_gray = rgb2gray(img);
    if n == 0 and m == 0: # n and m are global variables
        n, m = np.shape(img_gray);
    img_gray = np.reshape(img_gray, n*m);
    img_gray = np.append(img_gray, sample_id); # sample id stores the label of the training sample
    if len(samples) == 0: # samples is the final numpy ndarray
        samples = np.append(samples, img_gray);
        samples = np.reshape(samples, [1, n*m + 1]);
    else:
        samples = np.append(samples, [img_gray], axis=0);

因此，最终的数据结构应该有55x62个数组，其中每个数组的容量为1080000个元素。只存储最终结构（中间矩阵的范围是本地的）。

为了学习模型而存储的数据量相当大（我猜），因为程序并没有真正超越某一点，并且使我的系统崩溃到必须修复BIOS的程度！

到目前为止，该程序只收集要发送给分类器的数据......分类尚未引入到代码中。

有关如何更有效地处理数据的建议？

注意：我正在使用numpy来存储扁平矩阵的最终结构。此外，系统还有一个8Gb RAM。

Answer 1

这似乎是堆栈溢出的情况。如果我理解你的问题，你有3,682,800,000个数组元素。什么是元素类型？如果它是一个字节，即大约3千兆字节的数据，很容易填满你的堆栈大小（通常大约1兆字节）。即使只有一位元素，你仍然是500 MB。尝试使用堆内存（在您的计算机上最多8个演出）

Answer 2

我被鼓励将此作为解决方案发布，尽管上述评论可能更具启发性。

用户程序的问题有两个方面。真的，这只是压倒堆栈。

更常见的是，尤其是计算机图形或计算机视觉等图像处理，是一次处理一个图像。这可以很好地与sklearn一起使用，您可以在图像中读取时更新模型。

你可以使用从this堆栈文章中找到的这段代码：

import os
rootdir = '/path/to/my/pictures'

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        if file[-3:] == 'png': # or whatever your file type is / some check
             # do your training here
             img = imread(file)

             img_gray = rgb2gray(img)
             if n == 0 and m == 0: # n and m are global variables
                 n, m = np.shape(img_gray);
             img_gray = np.reshape(img_gray, n*m)

             # sample id stores the label of the training sample
             img_gray = np.append(img_gray, sample_id) 

             # samples is the final numpy ndarray
             if len(samples) == 0: 
                 samples = np.append(samples, img_gray);
                 samples = np.reshape(samples, [1, n*m + 1])
             else:
                 samples = np.append(samples, [img_gray], axis=0)

这更像是伪代码，但总体流程应该有正确的想法。让我知道我还能做什么！如果您对一些很酷的深度学习算法感兴趣，请查看OpenCV。它们是一堆很酷的东西，图像可以提供很好的样本数据。

希望这有帮助。

Python（numpy）崩溃了大量数组元素的系统

2 个答案: