Question

I want to tokenize the following text:

In Düsseldorf I took my hat off. But I can't put it back on.


'In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 
'can't', 'put', 'it', 'back', 'on', '.'

But to my surprise none of the NLTK tokenizers work. How can I accomplish did? Is it possible to use a combination of these tokenizers somehow to achieve the above?

Answer 1

您可以将其中一个标记器作为起点，然后修复收缩（假设这是问题）：

from nltk.tokenize.treebank import TreebankWordTokenizer

text = "In Düsseldorf I took my hat off. But I can't put it back on."
tokens = TreebankWordTokenizer().tokenize(text)

contractions = ["n't", "'ll", "'m"]
fix = []
for i in range(len(tokens)):
    for c in contractions:
        if tokens[i] == c: fix.append(i)

fix_offset = 0
for fix_id in fix:
    idx = fix_id - 1 - fix_offset
    tokens[idx] = tokens[idx] + tokens[idx+1]
    del tokens[idx+1]
    fix_offset += 1

print(tokens)

>>>['In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']

Answer 2

您应该在对单词进行标记之前对该句子进行标记：

>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]

如果您想将它们放回一个列表中：

>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "In Düsseldorf I took my hat off. But I can't put it back on."
>>> text = [word_tokenize(s) for s in sent_tokenize(text)]
>>> text
[['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.'], ['But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']]
>>> list(chain(*text))
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', 'ca', "n't", 'put', 'it', 'back', 'on', '.']

如果你必须放["ca", "n't"] -> ["can't"]：

>>> from itertools import izip_longest, chain
>>> tok_text = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
>>> contractions = ["n't", "'ll", "'re", "'s"]

# Iterate through two words at a time and then join the contractions back.
>>> [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", "n't", 'put', 'it', 'back', 'on', '.']
# Remove all contraction tokens since you've joint them to their root stem.
>>> [w for w in [w1+w2 if w2 in contractions else w1 for w1,w2 in izip_longest(tok_text, tok_text[1:])] if w not in contractions]
['In', 'D\xc3\xbcsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I', "can't", 'put', 'it', 'back', 'on', '.']

Simple tokenization issue in NTLK

2 个答案: