mean()得到了一个意想不到的关键字参数' dtype'!

时间:2017-07-12 10:13:49

标签: python numpy apache-spark pyspark bigdl

我正在尝试使用Intel Bigdl实现图像分类。它使用mnist数据集进行分类。既然如此,我不想使用mnist数据集,我在下面编写了替代方法:

Image Utils.py

<p></p>

现在,当我尝试使用如下的真实图像获取数据时:

Classification.py

from StringIO import StringIO
from PIL import Image
import numpy as np
from bigdl.util import common
from bigdl.dataset import mnist
from pyspark.mllib.stat import Statistics

def label_img(img):
    word_label = img.split('.')[-2].split('/')[-1]
    print word_label
    # conversion to one-hot array [cat,dog]
    #                            [much cat, no dog]
    if "jobs" in word_label: return [1,0]
    #                             [no cat, very doggo]
    elif "zuckerberg" in word_label: return [0,1]

    # target is start from 0,

def get_data(sc,path):
    img_dir = path
    train = sc.binaryFiles(img_dir + "/train")
    test = sc.binaryFiles(img_dir+"/test")
    image_to_array = lambda rawdata: np.asarray(Image.open(StringIO(rawdata)))

    train_data = train.map(lambda x : (image_to_array(x[1]),np.array(label_img(x[0]))))
    test_data = test.map(lambda x : (image_to_array(x[1]),np.array(label_img(x[0]))))

    train_images = train_data.map(lambda x : x[0])
    test_images = test_data.map((lambda x : x[0]))
    train_labels = train_data.map(lambda x : x[1])
    test_labels = test_data.map(lambda x : x[1])

    training_mean = np.mean(train_images)
    training_std = np.std(train_images)
    rdd_train_images = sc.parallelize(train_images)
    rdd_train_labels = sc.parallelize(train_labels)
    rdd_test_images = sc.parallelize(test_images)
    rdd_test_labels = sc.parallelize(test_labels)

    rdd_train_sample = rdd_train_images.zip(rdd_train_labels).map(lambda (features, label):
                                        common.Sample.from_ndarray(
                                        (features - training_mean) / training_std,
                                        label + 1))
    rdd_test_sample = rdd_test_images.zip(rdd_test_labels).map(lambda (features, label):
                                        common.Sample.from_ndarray(
                                        (features - training_mean) / training_std,
                                        label + 1))

    return (rdd_train_sample, rdd_test_sample)

我收到以下错误

  

TypeError Traceback(最近一次通话&gt;最后一次)    in()

     
    

2#获取并存储MNIST到Sample的RDD,请编辑&#34; mnist_path&#34;相应

         

3 path =&#34; / home / fusemachine / Hyper / person&#34;

         

----&GT; 4(train_data,test_data)= get_data(sc,path)

         

5 print train_data.count()

         

6 print test_data.count()

  
     

/home/fusemachine/Downloads/dist-spark-2.1.0-scala-2.11.8-linux64-0.1.1-dist/imageUtils.py in get_data(sc,path)

     
    

31 test_labels = test_data.map(lambda x:x [1])

         

---&GT; 33 training_mean = np.mean(train_images)

         

34 training_std = np.std(train_images)

         

35 rdd_train_images = sc.parallelize(train_images)

  
     

/opt/anaconda3/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in mean(a,axis,dtype,out,keepdims)

     
    

2884通过

         

2885其他:

         

- &GT; 2886返回平均值(轴=轴,dtype = dtype,out = out,** kwargs)

         

2887

         

2888 return _methods._mean(a,axis = axis,dtype = dtype,

  
     

TypeError:mean()得到了一个意外的关键字参数&#39; dtype&#39;

我无法找到解决方案。还有mnist数据集的任何其他替代方法。那么我们可以直接处理真实的图像吗? 谢谢

1 个答案:

答案 0 :(得分:0)

train_images是一个rdd,你不能在rdd上应用numpy mean。一种方法是使用collect(),然后应用numpy mean,

 train_images = train_data.map(lambda x : x[0]).collect()
 training_mean = np.mean(train_images)

或rdd.mean()

  training_mean = train_images.mean()