Question

当遇到大写单词时，PunktSentenceTokenizer似乎只是忽略了现有的缩写/并置。我写了一个小例子来演示行为，在现实生活中我有更大的文件，尝试过训练等

import docx
import nltk
import pickle
punkt_tk = nltk.data.load('tokenizers/punkt/english.pickle')
punkt_tk._params.abbrev_types.add('p.o')
punkt_tk._params.abbrev_types.add('p. o')
punkt_tk._params.abbrev_types.add('o')
punkt_tk._params.collocations.add(('p.o.','box'))
punkt_tk._params.collocations.add(('p. o.','box'))
punkt_tk._params.collocations.add(('o.','box'))
txt = 'its registered office address at P.O. Box 111 and having its registered office address at P. O. Box: 222'
d = punkt_tk.debug_decisions(txt)
for x in d:
    print(nltk.tokenize.punkt.format_debug_decision(x))

结果是

Text: 'P.O. Box' (at offset 36)
Sentence break? True (abbreviation + orthographic heuristic)
Collocation? True
'p.o.':
    known abbreviation: True
    is initial: False
'box':
    known sentence starter: False
    orthographic heuristic suggests is a sentence starter? True
    orthographic contexts in training: {'UNK-LC', 'MID-LC'}

Text: 'P. O.' (at offset 91)
Sentence break? False (initial + special orthographic heuristic)
Collocation? False
'p.':
    known abbreviation: True
    is initial: True
'o.':
    known sentence starter: False
    orthographic heuristic suggests is a sentence starter? unknown
    orthographic contexts in training: set()

Text: 'O. Box:' (at offset 94)
Sentence break? None (default decision)
Collocation? True
'o.':
    known abbreviation: True
    is initial: True
'box':
    known sentence starter: False
    orthographic heuristic suggests is a sentence starter? True
    orthographic contexts in training: {'UNK-LC', 'MID-LC'}

如果我将“ Box”更改为小写，则不会出现不必要的句子中断。

这是使nltk.PunktSentenceTokenizer不忽略支持启发式的搭配/缩写的一种方法吗？

0 个答案: