spacy中的句子标记化是不好的(?)

时间:2017-12-13 09:03:26

标签: python-2.7 nltk spacy

为什么spacy的句子分割器/标记器工作不好? nltk似乎工作正常。这是我的小经历:

import spacy
nlp = spacy.load('fr')
import nltk

text_fr = u"Je suis parti a la boulangerie. J'ai achete trois croissants. C'etait super bon."


nltk.sent_tokenize(text_fr)
# [u'Je suis parti a la boulangerie.',
# u"J'ai achete trois croissants.",
# u"C'etait super bon."


doc = nlp(text_fr)
for s in doc.sents: print s
# Je suis parti
# a la boulangerie. J'ai
# achete trois croissants. C'
# etait super bon.

我注意到英语的行为相同。对于这段文字:

text = u"I went to the library. I did not know what book to buy, but then the lady working there helped me. It was cool. I discovered a lot of new things."

我使用spacy(nlp=spacy.load('en')之后):

I
went to the library. I
did not know what book to buy, but
then the lady working there helped me. It was cool. I discovered a
lot of new things.

使用nltk看起来不错:

[u'I went to the library.',
 u'I did not know what book to buy, but then the lady working there helped me.',
 u'It was cool.',
 u'I discovered a lot of new things.']

1 个答案:

答案 0 :(得分:1)

我现在不怎么样,但事实证明我使用的是旧版spacy(v 0.100)。我再次安装了最新的spacy(v2.0.4),现在句子拆分更加连贯