在RDD

时间:2019-07-01 10:09:06

标签: python class apache-spark pyspark rdd

我的问题听起来与thisthis有点相似,但是尝试这些问题的解决方案也没有帮助我。
我有一个定义为的类标记器-

class Tokenizer:
    def __init__(self, preserve_case=False):
        self.preserve_case = preserve_case

    def tokenize(self, s):
        """
        Argument: s -- any string or unicode object
        Value: a tokenize list of strings; conatenating this list returns the original string if preserve_case=False
        """        
        # Try to ensure unicode:
        try:
            s = str(s)
        except UnicodeDecodeError:
            s = s.encode('string_escape')
            s = str(s)
        # Fix HTML character entitites:
        s = self.__html2unicode(s)
        # Tokenize:
        words = word_re.findall(s)
        # Possible alter the case, but avoid changing emoticons like :D into :d:
        if not self.preserve_case:            
            words = map((lambda x : x if emoticon_re.search(x) else x.lower()), words)
        return words
tok=Tokenizer(preserve_case=False)

我的(key,value)RDD为(user_id,tweets)。我想在类tokenizer的功能tokenize上使用RDD的tweet。我所做的是-

rdd.foreach(lambda x:tok.tokenize(x[1])).take(5)  

出现错误-

  

'NoneType'对象没有属性'take'

我也尝试过-

rdd1.map(lambda x:tok.tokenize(x[1])).take(5)  

出现错误-

  
    (p)中的Py4JJavaError追溯(最近一次调用为最新)     ----> 1 rdd1.map(lambda x:tok.tokenize(x 1))。take(5)

  
     

〜/ anaconda3 / lib / python3.6 / site-packages / pyspark / rdd.py in take(self,   num)1358 1359 p =范围(扫描的部分,   分钟(partsScanned + numPartsToTry,totalParts))   -> 1360 res = self.context.runJob(self,takeUpToNumLeft,p)1361 1362个项目+ = res

     

〜/ anaconda3 / lib / python3.6 / site-packages / pyspark / context.py在   runJob(self,rdd,partitionFunc,partitions,allowLocal)1067

     

SparkContext#runJob。第1068章真相大白      

-> 1069 sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(),mappedRDD._jrdd,分区)1070返回   列表(_load_from_socket(sock_info,已映射RDD._jrdd_deserializer))
  1071

     

〜/ anaconda3 / lib / python3.6 / site-packages / py4j / java_gateway.py在   的呼叫(个体,*参数)1255答案= self.gateway_client.send_command(命令)1256 RETURN_VALUE   = get_return_value(   -> 1257 answer,self.gateway_client,self.target_id,self.name)1258 1259 for temp_args中的temp_arg:

     

〜/ anaconda3 / lib / python3.6 / site-packages / py4j / protocol.py在   get_return_value(回答,gateway_client,target_id,名)       第326章       327“调用{0} {1} {2}时发生错误。\ n”。   -> 328格式(target_id,“。”,名称),值)       329其他:       330提高Py4JError(

     

Py4JJavaError:调用时发生错误   z:org.apache.spark.api.python.PythonRDD.runJob。 :   org.apache.spark.SparkException:由于阶段失败,作业中止了:   阶段39.0中的任务0失败1次,最近一次失败:丢失的任务   在阶段39.0(TID 101,本地主机,执行程序驱动程序)中为0.0:org.apache.spark.api.python.Python.PythonException:追溯(最新   最后调用):文件   “ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   377行,在主要       process()文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   372行,进行中       serializer.dump_stream(func(split_index,iterator),outfile)文件   “ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,   第397行,在dump_stream中       字节= self.serializer.dumps(vs)文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,   576行,转储       返回pickle.dumps(obj,protocol)AttributeError:无法腌制本地对象'Tokenizer.tokenize ..'

     在

  org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:452)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:588)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:571)     在   org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:406)     在   org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)     在scala.collection.Iterator $ class.foreach(Iterator.scala:891)处   org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)     在   scala.collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:59)     在   scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:104)     在   scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:48)     在   scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:310)     在   org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)     在   scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:302)     在   org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)     在   scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:289)     在   org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)     在   org.apache.spark.api.python.PythonRDD $$ anonfun $ 3.apply(PythonRDD.scala:153)     在   org.apache.spark.api.python.PythonRDD $$ anonfun $ 3.apply(PythonRDD.scala:153)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2101)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2101)     在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)     在org.apache.spark.scheduler.Task.run(Task.scala:121)在   org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:408)     在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)     在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:414)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     在java.lang.Thread.run(Thread.java:748)

     

驱动程序堆栈跟踪:位于   org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1889)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1877)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1876)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)     在scala.Option.foreach(Option.scala:257)在   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:49)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)处   org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)在   org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)在   org.apache.spark.api.python.PythonRDD $ .runJob(PythonRDD.scala:153)在   org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)在   sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)位于   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)在   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在   py4j.Gateway.invoke(Gateway.java:282)在   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)处   py4j.GatewayConnection.run(GatewayConnection.java:238)在   java.lang.Thread.run(Thread.java:748)由以下原因引起:   org.apache.spark.api.python.PythonException:追溯(最新   最后调用):文件   “ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   377行,在主要       process()文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   372行,进行中       serializer.dump_stream(func(split_index,iterator),outfile)文件   “ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,   第397行,在dump_stream中       字节= self.serializer.dumps(vs)文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,   576行,转储       返回pickle.dumps(obj,protocol)AttributeError:无法腌制本地对象'Tokenizer.tokenize ..'

     在

  org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:452)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:588)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:571)     在   org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:406)     在   org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)     在scala.collection.Iterator $ class.foreach(Iterator.scala:891)处   org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)     在   scala.collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:59)     在   scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:104)     在   scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:48)     在   scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:310)     在   org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)     在   scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:302)     在   org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)     在   scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:289)     在   org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)     在   org.apache.spark.api.python.PythonRDD $$ anonfun $ 3.apply(PythonRDD.scala:153)     在   org.apache.spark.api.python.PythonRDD $$ anonfun $ 3.apply(PythonRDD.scala:153)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2101)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2101)     在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)     在org.apache.spark.scheduler.Task.run(Task.scala:121)在   org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:408)     在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)     在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:414)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     ...还有1个

任何帮助将不胜感激。预先感谢!

1 个答案:

答案 0 :(得分:0)

  

rdd.foreach(lambda x:tok.tokenize(x[1])).take(5)

此处您正尝试从rdd.foreach()访问结果,该结果为空。

  

rdd1.map(lambda x:tok.tokenize(x[1])).take(5)

在这里,您使用带有lambda的自定义对象,这将引发下一个异常:

  

AttributeError:无法腌制本地对象'Tokenizer.tokenize ..'

这实际上意味着pyspark无法序列化Tokenizer.tokenize方法。一种可能的解决方案是从一个函数调用tok.tokenize(x[1]),然后在映射中将引用传递给该函数,如下所示:

def tokenize(x):
  return tok.tokenize(x[0])

rdd1.map(tokenize).take(5)

在您的代码中,您还有一个问题。类令牌生成器正在尝试访问未声明的self.__html2unicode(s)方法。这将导致以下错误:

AttributeError: 'Tokenizer' object has no attribute '_Tokenizer__html2unicode'

相关主题

PySpark: PicklingError: Could not serialize object: TypeError: can't pickle CompiledFFI objects

https://github.com/yahoo/TensorFlowOnSpark/issues/198