Question

我是Python，scikit-learn和numpy的初学者。我有一组包含图像的文件夹，我要针对它们应用不同的机器学习算法。但是，我正在努力将这些图像转换为可用的numpy数据。

这些是我的先决条件：

每个文件夹名称均包含图像的关键。例如/birds/abc123.jpg和/birds/def456.jpg都是“鸟”
每个图像均为100x100px jpg
我正在使用Python 2.7
共有2800张图像

这是我得到的代码：

# Standard scientific Python imports
import matplotlib.pyplot as plt

# Import datasets, classifiers and performance metrics
from sklearn import svm, metrics

import numpy as np

import os # Working with files and folders

from PIL import Image # Image processing

rootdir = os.getcwd()
key_array = []
pixel_arr = np.empty((0,10000), int)

for subdir, dirs, files in os.walk('data'):
  dir_name = subdir.split("/")[-1]
  if "x" in dir_name:
    key_array.append(dir_name)
    for file in files:
      if ".DS_Store" not in file:
        file = os.path.join(subdir, file)
        im = Image.open(file)
        im_bw = im.convert('1') #Black and white
        new_np = np.array(im_bw2).reshape(1,-1)
        print new_np.shape
        pixel_arr = np.append(pixel_arr, new_np, axis=0)

此代码中起作用的是浏览文件夹，获取文件夹名称并获取正确的文件/图像。我无法工作的是创建一个2800,10000的numpy数组（或者正确的可能是10000,2800），即2800行，每行中有10000个值。

尽管这个解决方案（我不确定是否可行）非常慢，但我可以肯定，必须有一个比这个更快，更优雅的解决方案！

如何创建此2800x10000 numpy数组，最好是附加key_array中的索引号？

Answer 1

如果您不需要同时使用所有图像，则可以使用生成器。

def get_images():
  for subdir, dirs, files in os.walk('data'):
    dir_name = subdir.split("/")[-1]
    if "x" in dir_name:
      key_array.append(dir_name)
      for file in files:
        if ".DS_Store" not in file:
          file = os.path.join(subdir, file)
          im = Image.open(file)
          im_bw = im.convert('1') #Black and white

          yield np.array(im_bw2).reshape(1,-1)

这样，您不会将所有图像同时保存在内存中，这可能会帮到您。

使用您随后将要使用的图像：

for image in get_images():
  ...

从不同文件夹中的图像创建numpy数组

1 个答案: