Question

我正在尝试根据给定的字符偏移量获取跨度的开始/结束标记索引。

我已经搜索了一些posts和解决方案（＃1264），但在我的情况下它们不起作用。我在此上花了很多时间，我认为这对其他人会有所帮助！

我的数据集的格式为：

Text, Pronoun_offset, Pronoun, A_offset, A, B_offset, B

A_offset是文本中跨度A开头的字符偏移

现在，我想在对Text进行标记后，获取文档中跨度A（或B或代词）的开始和结束标记索引。

第一种情况：

Pronoun                                                         she
Pronoun-offset                                                  272
A                                                           Melinda
A-offset                                                        127
B                                                             Delia
B-offset                                                        261

首先，假设我们要获取范围A Delia的包容性令牌索引我的计划是

text = " A woman comes to Mel's shop to sell antiques from a house she's moving from after the death of her daughter (from an illness). Melinda very quickly discovers that the house is haunted by violent spirits after Ned gets hurt there-which doesn't go down well with Delia- and she realizes that the ghost of the little girl (Cassidy) is being trapped there."

nlp = spacy.load("en_core_web_sm", disable = ['vectors', 'textcat', 'tagger', 'parser', 'ner'])
doc = nlp(text) 
char_offsets = [token.idx for token in doc]
token_len = len(self._spacy(A))
start_idx = char_offsets.index(A_offset)
end_idx = start_idx + token_len - 1
span = (start_idx, end_idx)
return tuple(span)

返回的跨度将给我Delia-而不是Delia。然后我这样做：

suffix_re = re.compile(r'''-+$''')
nlp.tokenizer = Tokenizer(nlp.vocab, suffix_search=suffix_re.search)

然后就无法将Pauline's分为Pauline和's来进行句子

佐伊·特尔福德（Zoe Telford）-饰演西蒙·玛姬（Simon，Maggie）的女警官。在与詹妮（Jenny）上床之后，在系列1的最后一集中被西蒙（Simon）抛弃，此人再也没有露面。菲比·托马斯（Phoebe Thomas）饰演了宝琳的朋友谢丽尔·卡西迪（Cheryl Cassidy），也是西蒙班11年级的学生。在西蒙不想与她发生性关系之后，按照西蒙的建议抛弃了她的男朋友，但后来意识到这是由于他从她的朋友波琳身上抓到螃蟹。

Pronoun                                                         her
Pronoun-offset                                                  274
A                                                    Cheryl Cassidy
A-offset                                                        191
B                                                           Pauline
B-offset                                                        207

第二种情况：拆分不一致 然后，我想到了另一种方法，在文本字符串的跨度之前和之后插入标记。标记就像@$A$@ 请注意，标记的开头和结尾都有一个空格。

然后我将找到每个跨度的标记，然后计算跨度的标记索引

对于这样的句子

评论员彼得·特拉弗斯（Peter Travers）在滚石乐队（Rolling Stone）中写道：``浓浓的-才华横溢，令人陶醉-是为美味的朱莉和朱莉娅制造诡计的香料。''斯特里普不是在玩朱莉娅·柴尔德，而是更加难以捉摸和更加真实的东西-她在扮演我们对朱莉娅·柴尔德的想法。''

Pronoun                                                         she
Pronoun-offset                                                  305
A                                                                Streep
A-offset                                                        215
B                                                           Julia Child
B-offset                                                        236

当我不插入Matk时，truthful--将被标记为truthful和--，这正是我想要的。插入标记后，truthful--将被标记为truthful 由于我跟踪了标记化结果的标记索引，因此使我无法获得正确的标记索引。

请问您对此有何看法？非常感谢您的帮助！

环境

spaCy version      2.0.12
Platform           Linux-4.4.0-31-generic-x86_64-with-debian-stretch-sid
Python version     3.7.2
Models             en

如何使用spaCy获取给定跨度的字符偏移量的开始/结束标记索引？

环境

0 个答案: