问题1：令牌偏移

Question

我在该文档中有一个spaCy文档和任意字符偏移 n ？如何找到该偏移后的第一个标记边界，即最小m≥n，使得 m 是标记的开头？

除了循环遍历所有标记之外，还有一些方法可以使用spaCy接口吗？

Answer 1

问题1：令牌偏移

How do I find the first token boundary following that offset...

对于sPacy中的任何对象，都有一个.text字段。因此Tokens和Documents可以与此原始文本字段一起使用。

此外，sPacy提供了两种获取令牌偏移的方法。

i令牌列表中的索引
idx .text

因此，在您的示例中，我认为您只需要以下内容：

>>> n = 10
>>> doc = nlp("here is a document with tokens in it")
>>> for token in doc:
...     if token.idx > n:
...             m = token.idx
...             break
... 
>>> m
19
>>> doc.text[m]
'w'
>>> token.i
4
>>> token
with
>>>

问题2：无循环查找

Is there some way to do this ... other than looping ...

可悲的是，我不相信Document级别上有任何其他接口允许通过char offset找到令牌。

如何在spaCy文档中的任意字符偏移后找到第一个标记？

1 个答案:

问题1：令牌偏移

问题2：无循环查找