我正在为pyspark mllib执行DecisionTree分类器,当我直接提供字典时,我将变量中存储的字典对象提供给categoricalFeaturesInfo时遇到问题。以下是两种情况下的代码
case1)cat_counts = {1:12,2:3,3:4,4:2,6:2,7:2,8:3,10:12,15:4} DecisionTree.trainClassifier(数据,numClasses = 2,categoricalFeaturesInfo = cat_counts,杂质='基尼&#39)
case2)DecisionTree.trainClassifier(data,numClasses = 2,categoricalFeaturesInfo = {1:12,2:3,3:4,4:2,6:2,7:2,8:3,10 :12,15:4},杂质=' gini')
Case2 执行时没有任何错误,但案例1 会引发下面列出的错误。 在这两种情况下,我都提供相同的字典,其中包含每个分类属性中索引和唯一级别的映射。
通话时发生错误 Z:org.apache.spark.mllib.api.python.SerDe.loads。 : net.razorvine.pickle.PickleException:预期的零参数 构造ClassDict(for numpy.dtype)at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) 在net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)at at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)at at org.apache.spark.mllib.api.python.SerDe $ .loads(PythonMLLibAPI.scala:1475) 在 org.apache.spark.mllib.api.python.SerDe.loads(PythonMLLibAPI.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:606)at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)at at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)at py4j.Gateway.invoke(Gateway.java:259)at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 在py4j.commands.CallCommand.execute(CallCommand.java:79)at py4j.GatewayConnection.run(GatewayConnection.java:209)at java.lang.Thread.run(Thread.java:745)
Traceback(最近一次调用最后一次):文件 " / var / tmp / spark / 4550967606900043523",第90行,执行 exec代码在global_dict File"",第1行,在File" /usr/hdp/current/spark-client/python/pyspark/mllib/tree.py"中, 第206行,在trainClassifier中 impurity,maxDepth,maxBins,minInstancesPerNode,minInfoGain)File" /usr/hdp/current/spark-client/python/pyspark/mllib/tree.py", 第146行,在_train impurity,maxDepth,maxBins,minInstancesPerNode,minInfoGain)File" /usr/hdp/current/spark-client/python/pyspark/mllib/common.py", 第130行,在callMLlibFunc中 return callJavaFunc(sc,api,* args)File" /usr/hdp/current/spark-client/python/pyspark/mllib/common.py" ;, line 122,在callJavaFunc中 args = [_py2java(sc,a)for a args] File" /usr/hdp/current/spark-client/python/pyspark/mllib/common.py" ;, line 88,在_py2java中 obj = sc._jvm.SerDe.loads(data)File" /usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 第813行,通话 回答,self.gateway_client,self.target_id,self.name)文件" /usr/hdp/current/spark-client/python/pyspark/sql/utils.py" ;,第45行, 装饰 返回f(* a,** kw)文件" /usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", 第308行,在get_return_value中 format(target_id,"。",name),value)Py4JJavaError:调用时发生错误 Z:org.apache.spark.mllib.api.python.SerDe.loads。 : net.razorvine.pickle.PickleException:预期的零参数 构造ClassDict(for numpy.dtype)at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) 在net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)at at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)at at org.apache.spark.mllib.api.python.SerDe $ .loads(PythonMLLibAPI.scala:1475) 在 org.apache.spark.mllib.api.python.SerDe.loads(PythonMLLibAPI.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:606)at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)at at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)at py4j.Gateway.invoke(Gateway.java:259)at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 在py4j.commands.CallCommand.execute(CallCommand.java:79)at py4j.GatewayConnection.run(GatewayConnection.java:209)at java.lang.Thread.run(Thread.java:745)
从那个错误我可以追踪它在做以下事情时发生 的 PickleSerializer()。转储(OBJ)
所以,我检查了以下内容 PickleSerializer()。转储(cat_Counts) - 它提供以下输出
' \ X80 \ X02} q \ X01(cnumpy.core.multiarray \ nscalar \ NQ \ x02cnumpy \ ndtype \ NQ \ x03U \ x02i8K \ x00K \ X01 \ x87Rq \ X04(K \ x03U \ X01
和 PickleSerializer()。转储({1:12,2:3,3:4,4:2,6:2,7:2,8:3,10:12,15:4})给出以下输出
' \ X80 \ X02} q \ X01(K \ x01K \ x0cK \ x02K \ x03K \ x03K \ x04K \ x04K \ x02K \ x06K \ x02K \ x07K \ x02K \ x08K \ x03K \了nK \ x0cK \ x0fK \ x04u'
我正在努力想看看如何解决这个问题。非常感谢任何帮助 提前致谢! -Satish