Question

我有一个非常简单的列表理解我想要并行化：

nlp = spacy.load(model)
texts = sorted(X['text'])
# TODO: Parallelize
docs = [nlp(text) for text in texts]

但是，当我尝试使用Pool模块中的multiprocessing时，如下所示：

docs = Pool().map(nlp, texts)

它给了我以下错误：

Traceback (most recent call last):
  File "main.py", line 117, in <module>
    main()
  File "main.py", line 99, in main
    docs = parse_docs(X)
  File "main.py", line 81, in parse_docs
    docs = Pool().map(nlp, texts)
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\pool.py", line 608, in get
    raise self._value
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\pool.py", line 385, in _handle_tasks
    put(task)
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "C:\Users\james\AppData\Local\Programs\Python\Python36-32\lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'FeatureExtracter.<locals>.feature_extracter_fwd'

是否可以在不必使对象成为可选择的情况下进行并行计算？我愿意接受与第三方库相关的示例，例如joblib等。

编辑：我也试过

docs = Pool().map(nlp.__call__, texts)

这也不起作用。

Answer 1

很可能不是。你可能试图分享一些不太安全的东西来分享不同的流程，例如：打开文件描述符的东西。 There's some discussion here关于为什么它不可拣选，他们模糊地说这是出于这样的原因。为什么不在每个流程中单独加载nlp？

此处的更多内容似乎也是spacy的一个普遍问题，他们正在努力解决：https://github.com/explosion/spaCy/issues/1045

Answer 2

以下是一种解决方法

texts = ["Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season.",
        "The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title.",
        "The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.",
        "As this was the 50th Super Bowl, the league emphasized the"]

def init():
    global nlp
    nlp = spacy.load('en')

def func(text):
    global nlp
    return nlp(text)

with mp.Pool(initializer=init) as pool:
    docs = pool.map(func, texts)

输出

for doc in docs:
    print(list(w.text for w in doc))

['Super', 'Bowl', '50', 'was', 'an', 'American', 'football', 'game', 'to', 'determine', 'the', 'champion', 'of', 'the', 'National', 'Football', 'League', '(', 'NFL', ')', 'for', 'the', '2015', 'season', '.']
['The', 'American', 'Football', 'Conference', '(', 'AFC', ')', 'champion', 'Denver', 'Broncos', 'defeated', 'the', 'National', 'Football', 'Conference', '(', 'NFC', ')', 'champion', 'Carolina', 'Panthers', '24–10', 'to', 'earn', 'their', 'third', 'Super', 'Bowl', 'title', '.']
['The', 'game', 'was', 'played', 'on', 'February', '7', ',', '2016', ',', 'at', 'Levi', "'s", 'Stadium', 'in', 'the', 'San', 'Francisco', 'Bay', 'Area', 'at', 'Santa', 'Clara', ',', 'California', '.']
['As', 'this', 'was', 'the', '50th', 'Super', 'Bowl', ',', 'the', 'league', 'emphasized', 'the']

没有酸洗的Python并行计算

2 个答案: