Question

显然for doc in nlp.pipe(sequence)比运行for el in sequence: doc = nlp(el) ..

快得多

我的问题是我的序列实际上是一个元组序列，其中包含用于spacy转换为文档的文本，还包含我想作为文档属性进入spacy文档的其他信息（将注册Doc）。

我不确定如何修改spacy管道，以便第一阶段真正从元组中选择一个项目来运行tokeniser并获取文档，然后让其他功能使用元组中的其余项目来将功能添加到现有文档中。

Answer 1

听起来您可能正在寻找nlp.pipe的as_tuples参数？如果设置as_tuples=True，则可以传递(text, context)元组的流，而spaCy将产生(doc, context)元组（而不是Doc对象）。然后，您可以使用上下文并将其添加到自定义属性等中。

这是一个例子：

data = [
  ("Some text to process", {"meta": "foo"}),
  ("And more text...", {"meta": "bar"})
]

for doc, context in nlp.pipe(data, as_tuples=True):
    # Let's assume you have a "meta" extension registered on the Doc
    doc._.meta = context["meta"]

使spacy nlp.pipe处理文本元组和其他信息添加为文档功能？

1 个答案: