如何处理categoricalFeaturesInfo?

时间:2016-06-24 08:02:54

标签: apache-spark pyspark random-forest apache-spark-mllib

如何处理RandomForest中的categoricalFeaturesInfo

我创建了一个像这样的变量列表:

alllist = listdouble + listint + listcategorielfeatures

但是当我创建LabeledPoint时,我失去了这个订单。

如何保存我的变量类型int,以便我之前拥有StringIndexer的分类功能。

Error  :
Py4JJavaError: An error occurred while calling o20271.trainRandomForestModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 29 in stage 1898.0 failed 4 times, most recent failure: Lost task 29.3 in stage 1898.0 (TID 748788, prbigdata1s013.bigplay.bigdata.intraxa): java.lang.IllegalArgumentException: DecisionTree given invalid data: Feature 517 is categorical with values in {0,...,16, but a data point gives it value 48940.0.
  Bad data point: (1.0,(825,[0,1,2,4,8,17,19,21,27,31,32,50,52,56,57,75,77,78,79,80,83,89,96,97,98,99,101,103,104,105,108,114,121,122,123,124,126,128,129,130,132,133,134,135,136,138,139,140,141,142,156,157,160,161,163,164,165,166,167,181,182,185,186,187,190,191,202,203,204,205,206,207,208,209,210,213,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,238,245,246,247,248,249,250,251,260,262,263,264,265,266,269,270,271,272,273,275,276,277,278,279,280,281,282,283,284,293,294,295,298,308,309,312,328,350,368,371,379,384,385,388,389,390,391,392,393,394,395,396,397,398,402,403,404,405,406,407,408,409,410,411,412,416,417,418,419,420,421,422,423,424,425,426,428,429,430,431,432,433,434,435,436,437,438,439,440,447,448,449,450,451,452,453,454,455,456,457,460,464,465,466,470,473,477,481,482,483,484,485,486,487,488,489,490,491,492,493,496,497,498,499,500,501,502,503,504,505,506,507,508,511,512,513,514,515,516,517,518,519,520,521,522,523,526,527,528,529,530,531,532,533,534,535,536,537,538,541,542,550,554,556,562,564,565,566,567,568,569,570,571,572,573,574,575,576,644,646,647,648,649,651,654,655,656,657,663,664,666,667,668,669,670,671,672,673,675,677,678,679,680,681,682,683,684,685,687,688,689,690,691,692,693,694,695,696,697,698,699,700,704,709,710,711,712,713,714,715,716,717,718,729,734,735,737,738,739,740,741,742,743,744,745,747,748,749,750,751,752,753,754,755,756,758,760,761,764,765,766,767,768,769,774,776,777,779,780,781,782,783,784,786,787,788,789,790,791,793,794,796,797,798,799,800,801,802,803,804,805,808,809,810,811,814,816,817,824],[10.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,200.0,2000.0,2000.0,460.0,305.0,2000.0,2000.0,460.0,305.0,81.76,69.8,31.66,5.28,18.8,162.06,20.6,51.96,27.6,108.74,77.5,66.16,30.0,5.0,17.82,153.62,19.52,49.24,26.18,103.08,1.23456789E9,1.23456789E9,1.23456789E9,0.01,2.0,14.0,14.0,63.0,3.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,3.0,1.0,1.0,1.0,1.0,3.0,2.0,2.0,3.0,1.0,400.0,1.0,1.0,13.0,15.0,19.0,20.0,25.0,1.23456789E9,1.23456789E9,1.23456789E9,3.0,5.0,6.0,7.0,8.0,9.0,10.0,12.0,1.0,13.0,15.0,19.0,20.0,25.0,1.23456789E9,1.23456789E9,1.23456789E9,3.0,5.0,6.0,7.0,8.0,9.0,10.0,12.0,1210.0,8.0,121112.0,130912.0,28.0,1.0,17450.0,1.0,8.0,1.0,1.0,8508.0,8508.0,10550.0,10000.0,8889.0,8426.0,8889.0,8426.0,8889.0,8426.0,8889.0,8426.0,2.0,1.0,100.0,100.0,4.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,4.0,5.0,3.0,4.0,10.0,11.0,12.0,10.0,10.0,7.0,4.0,4.0,4.0,3.0,4.0,10.0,11.0,12.0,10.0,10.0,7.0,4.0,4.0,4.0,3.0,4.0,10.0,11.0,13.0,10.0,9.0,7.0,5.0,1.0,1.0,4.0,4.0,3.0,4.0,10.0,11.0,14.0,10.0,9.0,7.0,5.0,4.0,5.0,3.0,4.0,10.0,11.0,12.0,10.0,10.0,7.0,4.0,2.0,1.0,1.0,1.0,15.0,1.0,1.0,38335.0,8815.0,78408.0,44160.0,37187.0,1079.0,51630.0,11873.0,17102.0,11839.0,10126.0,22676.0,7000.0,39303.0,9037.0,81842.0,48036.0,37187.0,1116.0,51630.0,11873.0,17102.0,11839.0,10126.0,22676.0,7000.0,40971.0,9422.0,80086.0,44257.0,37000.0,1064.0,48940.0,11255.0,16212.0,11224.0,9598.0,18600.0,7948.0,40971.0,9422.0,80086.0,44257.0,37000.0,1064.0,48940.0,11255.0,16212.0,11224.0,9598.0,18600.0,7948.0,1.2345678901234567E9,1.2345678901234567E9,1381780.0,1183365.0,1.23456789E9,1.0,1400.0,1400.0,1400.0,1400.0,1400.0,800.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,1400.0,462191.0,462191.0,677785.0,694715.0,729570.0,8.0,2.0,16.0,6.0,1.0,4.0,1.0,1.23456789E9,1.0,1.23456789E9,1.23456789E9,1.23456789E9,1.23456789E9,1.23456789E9,68.0,3304.0,24.0,54.0,34.0,2654.0,84.0,2494.0,2504.0,2534.0,44.0,6908.9,766.7,176.3,1568.16,883.2,743.74,21.58,1032.6,237.46,342.04,236.78,202.52,453.52,140.0,1.0,3.0,1.0,3.0,3.0,5.0,6.0,3.0,2.0,1.0,1.0,13.0,16.0,1.23456789E9,1.0,743.74,1595.08,342.52,21.58,1910.2,413.76,453.52,1119.98,1799.3,6804.6,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4621.91,1.0,2533.0,2940.49,2940.49,-1242.0,64.0,8832.67,1.0,2.0,1.0,101.0,1398.0,1581.0,1581.0,2281.0,1145.0,5.0,1.23456789E9,1070.0,4.0,50.0,1.2345678901234567E9,3000.0,1.23456789E9,5499.0,66240.0,66.0,1.23456789E9,1.23456789E9,1.0,1.0,1.23456789E9,3.0,1.23456789E9,1.23456789E9,1.23456789E9,3.0,3.0,1.0,6.0,1.23456789E9,1.23456789E9]))
    at org.apache.spark.mllib.tree.impl.TreePoint$.findBin(TreePoint.scala:140)
    at org.apache.spark.mllib.tree.impl.TreePoint$.org$apache$spark$mllib$tree$impl$TreePoint$$labeledPointToTreePoint(TreePoint.scala:84)
    at org.apache.spark.mllib.tree.impl.TreePoint$$anonfun$convertToTreeRDD$2.apply(TreePoint.scala:66)
    at org.apache.spark.mllib.tree.impl.TreePoint$$anonfun$convertToTreeRDD$2.apply(TreePoint.scala:65)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:686)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:685)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
    at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:685)
    at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:654)
    at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
    at org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291)
    at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:742)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: DecisionTree given invalid data: Feature 517 is categorical with values in {0,...,16, but a data point gives it value 48940.0.

0 个答案:

没有答案