Spacy 2.0 Spark集成:pickle.PicklingError:无法序列化对象

时间:2018-03-20 14:20:39

标签: apache-spark pyspark nlp spacy

我正在使用Spark 2.1.0和Spacy 2.0.9。根据Spacy文档,最新版本" spaCy v2现在完全支持Pickle协议,因此可以很容易地将spaCy与Apache Spark一起使用。"但是,我仍然收到错误。

这是我的代码:

from __future__ import unicode_literals
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
import spacy

nlp = spacy.blank('en')

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

spark = SparkSession.builder.appName("spark-spacy").getOrCreate()

print spark
docs = spark.read.option("delimiter", "\t").csv('./data.txt').toDF("id", "features")

print(docs.count())

def spacy_processed(current_sentence):
    return nlp(current_sentence)

spacy_processed_udf = udf(spacy_processed, StringType())
spacy_processed_docs = docs.withColumn("spacy_processed_features", 
spacy_processed_udf("features"))

spacy_processed_docs.show(10)

spark.stop()

以下是错误消息:

> Traceback (most recent call last):   File
> "/path/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 107, in dump
>     return Pickler.dump(self, obj)
> 
>   File "/path/anaconda/lib/python2.7/pickle.py", line 224, in dump
>     self.save(obj)
> 
> .............
> 
>   File
> "/path/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 206, in save_function
>     if islambda(obj) or obj.__code__.co_filename == '<stdin>' or themodule is 
> 
> None: AttributeError: 'builtin_function_or_method' object has no
> attribute '__code__'
> 
> Traceback (most recent call last):   File
> "/path/scratch/spacy-sparktest.py", line 24, in <module>
>     spacy_processed_udf = udf(spacy_processed, StringType())
> 
> pickle.PicklingError: Could not serialize object: AttributeError:
> 'builtin_function_or_method' object has no attribute '__code__'

0 个答案:

没有答案