问题

Question

我使用NLTK的PUNKT句子标记生成器将文件拆分成句子列表，并希望保留文件中的空行：

from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences

我想要打印：

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']

但实际打印的内容显示已从第一句和第三句中删除了空行：

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']

NLTK中的

Other tokenizers有一个blanklines='keep'参数，但在Punkt tokenizer的情况下，我没有看到任何此类选项。我很可能错过了一些简单的事情。有没有办法使用Punkt句子标记器重新训练这些尾随的空行？我很感激别人可以提供的任何见解！

Answer 1

问题

可悲的是，你不能让标记器保留空白线，而不是它的写入方式。

Starting here并通过span_tokenize（）和_slices_from_text（）调用函数后，可以看到有条件

if match.group('next_tok'):

旨在确保标记生成器跳过空格，直到下一个可能的句子开始标记出现。正在寻找正则表达式，我们最终会看到_period_context_fmt，我们看到next_tok命名组前面有\s+，其中不会捕获空白行。

解决方案

将其分解，更改您不喜欢的部分，重新组装自定义解决方案。

现在这个正则表达式位于PunktLanguageVars类中，它本身用于初始化PunktSentenceTokenizer类。我们只需要从PunktLanguageVars派生一个自定义类，并按照我们希望的方式修复正则表达式。

我们想要的修复是在句子末尾包含尾随换行符，所以我建议替换_period_context_fmt，从这里开始：

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        \s+(?P<next_tok>\S+)     # or whitespace and some other token
    ))"""

到此：

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    \s*                       #  <-- THIS is what I changed
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
    ))"""

现在使用此正则表达式而不是旧版本的标记生成器将在句子结束后包含0个或更多\s个字符。

整个脚本

import nltk.tokenize.punkt as pkt

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"

print(custom_tknzr.tokenize(s))

输出：

['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']

Answer 2

将输入拆分为段落，拆分捕获正则表达式（也返回捕获的字符串）：

paras = re.split("(\n\s*\n)", sentences)

然后，您可以将nltk.sent_tokenize()应用于各个段落，并按段落处理结果或压平列表 - 最适合您的进一步使用。

sents_by_para = [ nltk.sent_tokenize(p) for p in paras ]
flat = [ sent for par in sents_by_para for sent in par ]

（sent_tokenize()似乎并没有破坏仅限空格的字符串，因此无需检查并将其排除在处理之外。）

如果您特别希望将空格附加到上一个句子，您可以轻松地将其重新贴上：

collapsed = []
for s in flat:
    if s.isspace() and len(collapsed) > 0:
        collapsed[-1] += s
    else:
        collapsed.append(s)

Answer 3

我会选择itertools.groupby，请参阅Python: How to loop through blocks of lines：

alvas@ubi:~$ echo """This is a foo bar sentence,
that is also a foo bar sentence.

But I don't like foobars.
Yes you do like bars with foos, no?


I'm not sure whether you like bar bar!
Neither do I like black sheep.""" > test.in



alvas@ubi:~$ python
>>> from nltk import sent_tokenize
>>> import itertools
>>> with open('test.in', 'r') as fin:
...     for key, group in itertools.groupby(fin, lambda x: x!='\n'):
...             if key:
...                     print list(group)
... 
['This is a foo bar sentence,\n', 'that is also a foo bar sentence.\n']
["But I don't like foobars.\n", 'Yes you do like bars with foos, no?\n']
["I'm not sure whether you like bar bar!\n", 'Neither do I like black sheep.\n']

之后如果你想在小组中做sent_tokenize或其他punkt模型：

>>> with open('test.in', 'r') as fin:
...     for key, group in itertools.groupby(fin, lambda x: x!='\n'):
...             if key:
...                     paragraph = " ".join(line.strip() for line in group)
...                     print sent_tokenize(paragraph)
... 
['This is a foo bar sentence, that is also a foo bar sentence.']
["But I don't like foobars.", 'Yes you do like bars with foos, no?']
["I'm not sure whether you like bar bar!", 'Neither do I like black sheep.']

（注意：计算效率更高的方法是使用mmap，请参阅https://stackoverflow.com/a/3915398/610569。但是对于我工作的大小（约2000万个令牌）itertools.groupby就足够了）< / p>

Answer 4

最后，我最终结合了@alexis和@HugoMailhot的见解，以便在单个段落有多个句子和/或换行符的情况下保留换行符：

import re, nltk, sys, codecs
import nltk.tokenize.punkt as pkt
from nltk import data

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tokenizer = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

def sentence_split(s):
        '''Read in a string and return a list of sentences with linebreaks intact'''
        paras = re.split("(\n\s*\n)", s)
        sents_by_para = [custom_tokenizer.tokenize(p) for p in paras ]
        flat = [ sent for par in sents_by_para for sent in par ]

        collapsed = []
        for s in flat:
            if s.isspace() and len(collapsed) > 0:
                collapsed[-1] += s
            else:
                collapsed.append(s)

        return collapsed

if __name__ == "__main__":
        s = codecs.open(sys.argv[1],'r','utf-8').read()
        sentences = sentence_split(s)

使用NLTK的Punkt Tokenizer

4 个答案:

问题

解决方案

整个脚本