我有两个文本数据集,每个文本都有关于文本的问题和第一个问题的答案,有时还有没有答案的问题in the second one。对于每个问题,我都尝试通过依存关系分析来使用SpaCy en_nlp
找到问题的根源:
例如,'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'
在驯服后:
>>>[to_nltk_tree(sent.root).pretty_print() for sent in en_nlp(predicted.iloc[0,"question"]).sents]
appear
__________________|____________________________
| | | | | | in
| | | | | | |
| | | To Mary in France
| | | | ___|_____ | |
did allegedly ? whom the Virgin 1858 Lourdes
然后我尝试获取文本根:
for sent in doc.sents:
roots = [st.stem(chunk.root.head.text.lower()) for chunk in sent.noun_chunks]
print(roots)
['has', 'has']
['atop', 'is', 'of']
['in', 'of', 'fac', 'is', 'of', 'with', 'with', 'legend']
['to', 'is', 'of']
['behind', 'is', 'grotto', 'of', 'pray']
['is', 'is', 'of', 'at', 'lourd', 'appear', 'to']
['at', 'of', 'in', 'through', 'statu', 'is', 'of']
最后,我尝试从一个匹配到另一个找到一个答案。
如您所见on this attempt on the first dataset,该想法的准确率达到40%。但是准确率甚至下降到29%on the new dataset。
为什么依赖项解析效率的根匹配如此之低?