修改NLTK word_tokenize以防止括号的标记化

时间:2016-05-09 05:59:54

标签: python regex nlp nltk tokenize

我有以下main.py

#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import nltk
import string
import sys
for token in nltk.word_tokenize(''.join(sys.stdin.readlines())):
    #print token
    if len(token) == 1 and not token in string.punctuation or len(token) > 1:
        print token

输出如下。

./main.py <<< 'EGR1(-/-) mouse embryonic fibroblasts'
EGR1
-/-
mouse
embryonic
fibroblasts

我想略微更改tokenizer,以便将EGR1(-/-)识别为一个令牌(没有任何其他更改)。有没有人知道是否有这样的方法来轻微修改标记器?感谢。

1 个答案:

答案 0 :(得分:2)

word_tokenize()中的默认NLTK函数是TreebankWordTokenizer,它基于一系列正则表达式替换。

更具体地说,当在括号之间添加空格时,TreebankWordTokenizer使用此正则表达式替换:

PARENS_BRACKETS = [
    (re.compile(r'[\]\[\(\)\{\}\<\>]'), r' \g<0> '),
    (re.compile(r'--'), r' -- '),
]

for regexp, substitution in self.PARENS_BRACKETS:
    text = regexp.sub(substitution, text)

例如:

import re

text = 'EGR1(-/-) mouse embryonic fibroblasts'

PARENS_BRACKETS = [
    (re.compile(r'[\]\[\(\)\{\}\<\>]'), r' \g<0> '),
    (re.compile(r'--'), r' -- '),
]

for regexp, substitution in PARENS_BRACKETS:
    text = regexp.sub(substitution, text)

print text

[OUT]:

EGR1 ( -/- )  mouse embryonic fibroblasts

所以回到“黑客”NLTK word_tokenize()函数,你可以尝试这样的东西来取消PARENS_BRACKETS替换的效果:

>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.PARENS_BRACKETS = []
>>> text = 'EGR1(-/-) mouse embryonic fibroblasts'
>>> tokenizer.tokenize(text)
['EGR1(-/-)', 'mouse', 'embryonic', 'fibroblasts']