在pyspark mllib决策树分类器

时间:2016-06-02 23:21:34

标签: apache-spark pyspark apache-spark-mllib

我正在为pyspark mllib执行DecisionTree分类器,当我直接提供字典时,我将变量中存储的字典对象提供给categoricalFeaturesInfo时遇到问题。以下是两种情况下的代码

case1)cat_counts = {1:12,2:3,3:4,4:2,6:2,7:2,8:3,10:12,15:4} DecisionTree.trainClassifier(数据,numClasses = 2,categoricalFeaturesInfo = cat_counts,杂质='基尼&#39)

case2)DecisionTree.trainClassifier(data,numClasses = 2,categoricalFeaturesInfo = {1:12,2:3,3:4,4:2,6:2,7:2,8:3,10 :12,15:4},杂质=' gini')

Case2 执行时没有任何错误,但案例1 会引发下面列出的错误。 在这两种情况下,我都提供相同的字典,其中包含每个分类属性中索引和唯一级别的映射。

  

通话时发生错误   Z:org.apache.spark.mllib.api.python.SerDe.loads。 :   net.razorvine.pickle.PickleException:预期的零参数   构造ClassDict(for numpy.dtype)at   net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)     在net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)at   net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)at at   net.razorvine.pickle.Unpickler.load(Unpickler.java:99)at   net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)at at   org.apache.spark.mllib.api.python.SerDe $ .loads(PythonMLLibAPI.scala:1475)     在   org.apache.spark.mllib.api.python.SerDe.loads(PythonMLLibAPI.scala)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:606)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)at   py4j.Gateway.invoke(Gateway.java:259)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)     在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:209)at   java.lang.Thread.run(Thread.java:745)

     

Traceback(最近一次调用最后一次):文件   " / var / tmp / spark / 4550967606900043523",第90行,执行       exec代码在global_dict File"",第1行,在File" /usr/hdp/current/spark-client/python/pyspark/mllib/tree.py"中,   第206行,在trainClassifier中       impurity,maxDepth,maxBins,minInstancesPerNode,minInfoGain)File" /usr/hdp/current/spark-client/python/pyspark/mllib/tree.py",   第146行,在_train       impurity,maxDepth,maxBins,minInstancesPerNode,minInfoGain)File" /usr/hdp/current/spark-client/python/pyspark/mllib/common.py",   第130行,在callMLlibFunc中       return callJavaFunc(sc,api,* args)File" /usr/hdp/current/spark-client/python/pyspark/mllib/common.py" ;, line   122,在callJavaFunc中       args = [_py2java(sc,a)for a args] File" /usr/hdp/current/spark-client/python/pyspark/mllib/common.py" ;, line   88,在_py2java中       obj = sc._jvm.SerDe.loads(data)File" /usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",   第813行,通话       回答,self.gateway_client,self.target_id,self.name)文件" /usr/hdp/current/spark-client/python/pyspark/sql/utils.py" ;,第45行,   装饰       返回f(* a,** kw)文件" /usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py",   第308行,在get_return_value中       format(target_id,"。",name),value)Py4JJavaError:调用时发生错误   Z:org.apache.spark.mllib.api.python.SerDe.loads。 :   net.razorvine.pickle.PickleException:预期的零参数   构造ClassDict(for numpy.dtype)at   net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)     在net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)at   net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)at at   net.razorvine.pickle.Unpickler.load(Unpickler.java:99)at   net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)at at   org.apache.spark.mllib.api.python.SerDe $ .loads(PythonMLLibAPI.scala:1475)     在   org.apache.spark.mllib.api.python.SerDe.loads(PythonMLLibAPI.scala)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:606)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)at   py4j.Gateway.invoke(Gateway.java:259)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)     在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:209)at   java.lang.Thread.run(Thread.java:745)

从那个错误我可以追踪它在做以下事情时发生 的 PickleSerializer()。转储(OBJ)

所以,我检查了以下内容 PickleSerializer()。转储(cat_Counts) - 它提供以下输出

  

' \ X80 \ X02} q \ X01(cnumpy.core.multiarray \ nscalar \ NQ \ x02cnumpy \ ndtype \ NQ \ x03U \ x02i8K \ x00K \ X01 \ x87Rq \ X04(K \ x03U \ X01

和 PickleSerializer()。转储({1:12,2:3,3:4,4:2,6:2,7:2,8:3,10:12,15:4})给出以下输出

  

' \ X80 \ X02} q \ X01(K \ x01K \ x0cK \ x02K \ x03K \ x03K \ x04K \ x04K \ x02K \ x06K \ x02K \ x07K \ x02K \ x08K \ x03K \了nK \ x0cK \ x0fK \ x04u'

我正在努力想看看如何解决这个问题。非常感谢任何帮助 提前致谢! -Satish

0 个答案:

没有答案