当遇到大写单词时,PunktSentenceTokenizer似乎只是忽略了现有的缩写/并置。我写了一个小例子来演示行为,在现实生活中我有更大的文件,尝试过训练等
import docx
import nltk
import pickle
punkt_tk = nltk.data.load('tokenizers/punkt/english.pickle')
punkt_tk._params.abbrev_types.add('p.o')
punkt_tk._params.abbrev_types.add('p. o')
punkt_tk._params.abbrev_types.add('o')
punkt_tk._params.collocations.add(('p.o.','box'))
punkt_tk._params.collocations.add(('p. o.','box'))
punkt_tk._params.collocations.add(('o.','box'))
txt = 'its registered office address at P.O. Box 111 and having its registered office address at P. O. Box: 222'
d = punkt_tk.debug_decisions(txt)
for x in d:
print(nltk.tokenize.punkt.format_debug_decision(x))
结果是
Text: 'P.O. Box' (at offset 36)
Sentence break? True (abbreviation + orthographic heuristic)
Collocation? True
'p.o.':
known abbreviation: True
is initial: False
'box':
known sentence starter: False
orthographic heuristic suggests is a sentence starter? True
orthographic contexts in training: {'UNK-LC', 'MID-LC'}
Text: 'P. O.' (at offset 91)
Sentence break? False (initial + special orthographic heuristic)
Collocation? False
'p.':
known abbreviation: True
is initial: True
'o.':
known sentence starter: False
orthographic heuristic suggests is a sentence starter? unknown
orthographic contexts in training: set()
Text: 'O. Box:' (at offset 94)
Sentence break? None (default decision)
Collocation? True
'o.':
known abbreviation: True
is initial: True
'box':
known sentence starter: False
orthographic heuristic suggests is a sentence starter? True
orthographic contexts in training: {'UNK-LC', 'MID-LC'}
如果我将“ Box”更改为小写,则不会出现不必要的句子中断。