Question

我正在编写代码以从文本中提取某些信息，而我正在使用spaCy。

目标是，如果文本的特定标记包含字符串“ refstart”，那么我想获取该标记之前的名词块。仅供参考：包含“ refstart”和“ refend”的令牌是在使用spacy创建nlp对象之前使用正则表达式生成的。

到目前为止，我正在使用以下代码：

import spacy
nlp = spacy.load('en_core_web_sm')
raw_text='Figure 1 shows a cross-sectional view refstart10,20,30refend of a 
refrigerator refstart41,43refend that uses a new cooling technology  refstart10,23a,45refend including a retrofitting pump including high density fluid refstart10refend.'

doc3=nlp(raw_text)

list_of_references=[]
for token in doc3:
    # look if the token is a ref. sign
    # in order to see the functioning of the loops uncomment the prints
    # print('looking for:', token.text)
    if 'refstart' in token.text:
        #print('yes it is in')
        ref_token_text     = token.text
        ref_token_position = token.i
        # print('token text:',ref_token_text)
        for chunk in doc3.noun_chunks:
             if chunk.end == ref_token_position:
                 # we have a chunck and a ref. sign
                 list_of_references.append((chunk.text, chunk.start, chunk.end, ref_token_text))
                 break

这有效，我得到一个包含元组的列表，其中包括名词块的起始端和名词块后面的标记文本，其中包括字符串refstart。

此代码的结果应为：

横截面图，refstart10,20,30refend
一台冰箱，refstart41,43refend
一种新的冷却技术refstart10,23a,45refend
高密度流体refstart10refend

查看“改造泵”如何不属于列表，因为后面没有包含“ refstart”的令牌

但是对于循环很大的文本可能会大大减慢数据管道的速度，这是非常低效的。

解决方案2：我考虑过要创建一个带有其位置的令牌列表和一个名词块列表

# built the list with all the noun chunks, start and end in the text
list_chunks=[]
print("chuncks")
for chunk in doc3.noun_chunks:
   list_chunks.append((chunk.text,chunk.start,chunk.end))
   try:
       print(f'start:{chunk.start},end:{chunk.end} \t \t {chunk.text} \t following text:{doc3[chunk.end+1]}')
   except:
       # this is done just to avoid error breaking in the last chunk
       print(f'start:{chunk.start},end:{chunk.end} \t \t {chunk.text} \t following text:last on')

 print("refs------------------")  
 # build the list with all the tokens and their position
 list_ref_tokens=[]
 for token in doc3:
     if 'refstart' in token.text:
          list_ref_tokens.append((token.text,token.i))
          print(token.text,token.i)

但是现在我必须比较list_chunks和list_ref_tokens内部的Tupel，这也很棘手。

还有其他建议吗？

谢谢。

python spaCy块和令牌的交集

0 个答案: