我正在尝试使用NLTK解析许多文档中的句子。
大多数情况下,一切正常,但是我希望能够更准确地拆分编号列表。这是我得到的例子:
Transfer of personal data 3.
Personal Data may be disclosed by the SFC...
The names of persons who submit comments...
This will be done by publishing this...
Access to data 4.
You have the right to request access to and correction of your Personal Data in accordance with the provisions of the PDPO.
Retention 5.
Personal Data provided to...
1 Personal Data means personal data as defined in the Personal Data (Privacy) Ordinance (Cap. 486).
2 The term “relevant provisions” is defined...
3 Enquiries 6.
您应该看到很多句子都跟在列表/项目符号的末尾,而列表/项目符号应该在下一个句子的开头。
下面是我的代码
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys
from nltk import sent_tokenize
import pickle
import codecs
### TRAINING
text = codecs.open("corpus/en1.txt","r","utf8").read()
from pprint import pprint
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer, PunktLanguageVars, PunktParameters
trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.INCLUDE_ABBREV_COLLOCS = True
trainer.train(text)
tokenizer = PunktSentenceTokenizer(trainer.get_params())
### ADD BEGINNING OF SENTENCE
tokenizer._params.sent_starters.add('1.')
tokenizer._params.sent_starters.add('2.')
tokenizer._params.sent_starters.add('3.')
...
tokenizer.tokenize(some long text)
我什至尝试了一个随机关键字,但没有看到正在解析一个新句子,例如,关键字是下面的“信息”,属于我的长文本。
tokenizer._params.sent_starters.add('information'.decode("utf-8"))