Question

我正在使用spacycizer从spacy将文档拆分为句子。 sentencizer中的默认定界符为（'。'，'！'，'？'）。但是，如果我给这样的句子：

“小鹿在森林里竞速！他在兔子前面？他在大象前面。”

它没有分成3个句子。

我尝试过这个：

sen = "A fawn was racing in the forest!He was ahead of the rabbit?He       
was ahead of the elephant." 
nlp = spacy.load('en')
nlp.add_pipe(nlp.create_pipe('sentencizer'), first=True)
doc = nlp(sen)
sentences = [sent.string.strip() for sent in doc.sents]

但是它不会在！，？中分开。

输入的预期输出：

“小鹿在森林里竞速！他在兔子前面？他在大象前面。”

"A fawn was racing in the forest!"

"He was ahead of the rabbit?"

"He was ahead of the elephant."

任何人都可以帮忙。

谢谢。

Answer 1

我也遇到过类似的问题，即句号后加引号的句子缺少空格，例如：

他告诉我“去别的地方”。但是我不想去。

解决方案在此文档中：Customizing spaCy’s Tokenizer class

我从Adding Custom Tokenization Rules to spaCy那里得到了进一步的启发

这是一条有效的规则-基本上是从Adding Custom Tokenization Rules to spaCy复制而来的：

mvn test -Dsurefire.suiteXmlFiles="C:\Workspace Stackoverflow\TestNGProj\TestNGMod1\testng-Pkg2.xml"

使用spacy sendencizer从文档中拆分句子时出错

1 个答案: