我正在使用Spark 2.1.0和Spacy 2.0.9。根据Spacy文档,最新版本" spaCy v2现在完全支持Pickle协议,因此可以很容易地将spaCy与Apache Spark一起使用。"但是,我仍然收到错误。
这是我的代码:
from __future__ import unicode_literals
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
import spacy
nlp = spacy.blank('en')
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
spark = SparkSession.builder.appName("spark-spacy").getOrCreate()
print spark
docs = spark.read.option("delimiter", "\t").csv('./data.txt').toDF("id", "features")
print(docs.count())
def spacy_processed(current_sentence):
return nlp(current_sentence)
spacy_processed_udf = udf(spacy_processed, StringType())
spacy_processed_docs = docs.withColumn("spacy_processed_features",
spacy_processed_udf("features"))
spacy_processed_docs.show(10)
spark.stop()
以下是错误消息:
> Traceback (most recent call last): File
> "/path/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 107, in dump
> return Pickler.dump(self, obj)
>
> File "/path/anaconda/lib/python2.7/pickle.py", line 224, in dump
> self.save(obj)
>
> .............
>
> File
> "/path/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 206, in save_function
> if islambda(obj) or obj.__code__.co_filename == '<stdin>' or themodule is
>
> None: AttributeError: 'builtin_function_or_method' object has no
> attribute '__code__'
>
> Traceback (most recent call last): File
> "/path/scratch/spacy-sparktest.py", line 24, in <module>
> spacy_processed_udf = udf(spacy_processed, StringType())
>
> pickle.PicklingError: Could not serialize object: AttributeError:
> 'builtin_function_or_method' object has no attribute '__code__'