tf-idf实现中的错误mllib spark

时间:2016-02-26 09:33:56

标签: apache-spark tf-idf apache-spark-mllib

我正在尝试从一些twitters执行TF-IDF转换,然后将朴素的bayes应用到它,我在应用停用词并阻止到twitters之后有以下RDD:

[u'neutro, marc line polit ibex',
 u'neutro, ahor hac mas firm redact jef ibex dad result',
 u'neutro, temblor ml am epicentr sant santand repo endesarroll',
 u'neutro, cambi tiemp santand ciel cubie temperatur',
 u'neutro, sabi pued recobr inversion bon pr perd caus mal asesor ubs santand popul u oriental',
 u'neutro, renunci vital sal crisis',
 u'neutro, ibex sub punt',
 u'neutro, dias acampahd delant bbva manlleu dias luch i gan batall luch continu',
 u'neutro, mas natural repsol seri cobr dividend',
 u'neutro, luch ibex financi carin obedient',
 u'neutro, clav triunf basket cumpl futbol lig bbva via',
 u'neutro, colombi despleg primer red viual comercial lte pais',
 u'neutro, resistent clav bat repsol encontr eur',
 u'neutro, telefon lanz servici vide siet pais latinoamerican inclu',
 u'neutro, result empat gol calderon cierr jorn lig bbva lalig']

但是当我将以下代码应用于RDD时:

from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

documents = trainingSet_cleaned.map(lambda line: line.split(' '))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
tf.cache()
idf = IDF(minDocFreq=2).fit(tf)
#tfidf = idf.transform(tf)

我收到以下错误:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-291-69a2d82a8484> in <module>()
      6 tf = hashingTF.transform(documents)
      7 tf.cache()
----> 8 idf = IDF(minDocFreq=2).fit(tf)
      9 #tfidf = idf.transform(tf)
     10 

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/mllib/feature.py in fit(self, dataset)
    414         if not isinstance(dataset, RDD):
    415             raise TypeError("dataset should be an RDD of term frequency vectors")
--> 416         jmodel = callMLlibFunc("fitIDF", self.minDocFreq, dataset.map(_convert_to_vector))
    417         return IDFModel(jmodel)
    418 

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/mllib/common.py in callMLlibFunc(name, *args)
    128     sc = SparkContext._active_spark_context
    129     api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130     return callJavaFunc(sc, api, *args)
    131 
    132 

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/pyspark/mllib/common.py in callJavaFunc(sc, func, *args)
    121     """ Call Java Function """
    122     args = [_py2java(sc, a) for a in args]
--> 123     return _java2py(sc, func(*args))
    124 
    125 

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o1296.fitIDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 130.0 failed 1 times, most recent failure: Lost task 0.0 in stage 130.0 (TID 130, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<ipython-input-249-8d34bb5694f4>", line 1, in <lambda>
IndexError: list index out of range

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
    at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
    at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:249)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
    at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1003)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1114)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
    at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1091)
    at org.apache.spark.mllib.feature.IDF.fit(IDF.scala:56)
    at org.apache.spark.mllib.feature.IDF.fit(IDF.scala:69)
    at org.apache.spark.mllib.api.python.PythonMLLibAPI.fitIDF(PythonMLLibAPI.scala:602)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<ipython-input-249-8d34bb5694f4>", line 1, in <lambda>
IndexError: list index out of range

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
    at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
    at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:249)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
    at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)

我不知道我做错了什么,因为我在字面上应用了spark api中解释的功能。

有任何线索吗?

0 个答案:

没有答案