Question

我有来自维基百科的这篇文章：

Fr.提出了雄心勃勃的校园扩建计划。弗农F. 1952年加拉格尔。第一个学生宿舍Assumption Hall 1954年开业，罗克威尔厅于1958年11月投入使用，住在商业和法律学院。这是在任期内 F. Henry J. McAnulty，Fr。加拉格尔雄心勃勃的计划付诸实施动作。

我正在使用NLTK nltk.sent_tokenize从此文本中获取句子并输出：

 [
'An ambitious campus expansion plan was proposed by Fr.', 
'Vernon F. Gallagher in 1952.', 
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', 
'It was during the tenure of Fr.', 
'Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
 ]

虽然NTLK识别 F. Henry J. McAnulty 正确，它无法识别 Fr。 Vernon F. Gallagher ， Vernon F. Gallagher 部分成为新句子

所需的输出是：

[
'An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', 
'It was during the tenure of Fr. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
 ]

有没有办法改善这种情况下的标记器？

Answer 1

Kiss和Strunk（2006）Punkt算法的精彩之处在于它是无人监督的。因此，给定一个新文本，您应该重新训练模型并将模型应用于您的文本，例如

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> text = "An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."

# Training a new model with the text.
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>

# It automatically learns the abbreviations.
>>> tokenizer._params.abbrev_types
{'f', 'fr', 'j'}

# Use the customized tokenizer.
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]

如果在重新训练模型时没有足够的数据来生成良好的统计数据，您还可以在训练前输入预先确定的缩写列表;见How to avoid NLTK's sentence tokenizer spliting on abbreviations?

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

>>> punkt_param = PunktParameters()
>>> abbreviation = ['f', 'fr', 'k']
>>> punkt_param.abbrev_types = set(abbreviation)

>>> tokenizer = PunktSentenceTokenizer(punkt_param)
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>

>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]

如何改进NLTK的句子分割？

1 个答案: