nltk.TweetTokenizer中的Tokenize()通过拆分返回整数

时间:2017-07-31 21:59:57

标签: python nltk tokenize

nltk.TweetTokenizer中的Tokenize()返回32位整数,将它们分成数字。它只发生在一些特定的数字上,我没有看到任何理由?

>>> from nltk.tokenize import TweetTokenizer 
>>> tw = TweetTokenizer()
>>> tw.tokenize('the 23135851162 of 3151942776...')
[u'the', u'2313585116', u'2', u'of', u'3151942776', u'...']

输入23135851162已分为[u'2313585116', u'2']

有趣的是,它似乎将所有数字分成10位

>>> tw.tokenize('the 231358511621231245 of 3151942776...')
[u'the', u'2313585116', u'2123124', u'5', u'of', u'3151942776', u'...']
>>> tw.tokenize('the 231123123358511621231245 of 3151942776...')
[u'the', u'2311231233', u'5851162123', u'1245', u'of', u'3151942776', u'...']

数字标记的长度会影响标记化:

>>> s = 'the 1234567890 of'
>>> tw.tokenize(s)
[u'the', u'12345678', u'90', u'of']
>>> s = 'the 123456789 of'
>>> tw.tokenize(s)
[u'the', u'12345678', u'9', u'of']
>>> s = 'the 12345678 of'
>>> tw.tokenize(s)
[u'the', u'12345678', u'of']
>>> s = 'the 1234567 of'
>>> tw.tokenize(s)
[u'the', u'1234567', u'of']
>>> s = 'the 123456 of'
>>> tw.tokenize(s)
[u'the', u'123456', u'of']
>>> s = 'the 12345 of'
>>> tw.tokenize(s)
[u'the', u'12345', u'of']
>>> s = 'the 1234 of'
>>> tw.tokenize(s)
[u'the', u'1234', u'of']
>>> s = 'the 123 of'
>>> tw.tokenize(s)
[u'the', u'123', u'of']
>>> s = 'the 12 of'
>>> tw.tokenize(s)
[u'the', u'12', u'of']
>>> s = 'the 1 of'
>>> tw.tokenize(s)
[u'the', u'1', u'of']

如果连续数字+空格超出长度10:

>>> s = 'the 123 456 78901234  of'
>>> tw.tokenize(s)
[u'the', u'123 456 7890', u'1234', u'of']

2 个答案:

答案 0 :(得分:5)

TL; DR

这似乎是TweetTokenizer()的一个错误/特征,我们不确定这是什么动机。

继续阅读以找出错误/功能发生的位置......

在长

在TweetTokenizer中查看tokenize()函数,在实际标记化之前,标记生成器会进行一些预处理:

  • 首先,它通过_replace_html_entities()函数

  • 将实体转换为相应的unicode字符,从文本中删除实体
  • 可选地,它使用remove_handles()函数删除用户名句柄。

  • 可选择通过reduce_lengthening函数标准化单词长度

  • 然后,使用HANG_RE正则表达式

  • 缩短有问题的字符序列
  • 最后,实际的标记化通过WORD_RE正则表达式

  • 进行

WORD_RE正则表达式之后

  • 可选择在降低标记化输出
  • 之前保留表情符号的大小写

代码:

def tokenize(self, text):
    """
    :param text: str
    :rtype: list(str)
    :return: a tokenized list of strings; concatenating this list returns\
    the original string if `preserve_case=False`
    """
    # Fix HTML character entities:
    text = _replace_html_entities(text)
    # Remove username handles
    if self.strip_handles:
        text = remove_handles(text)
    # Normalize word lengthening
    if self.reduce_len:
        text = reduce_lengthening(text)
    # Shorten problematic sequences of characters
    safe_text = HANG_RE.sub(r'\1\1\1', text)
    # Tokenize:
    words = WORD_RE.findall(safe_text)
    # Possibly alter the case, but avoid changing emoticons like :D into :d:
    if not self.preserve_case:
        words = list(map((lambda x : x if EMOTICON_RE.search(x) else
                          x.lower()), words))
    return words

默认情况下,除非用户指定,否则句柄剥离和长度缩减不会启动。

class TweetTokenizer:
    r"""
    Tokenizer for tweets.

        >>> from nltk.tokenize import TweetTokenizer
        >>> tknzr = TweetTokenizer()
        >>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
        >>> tknzr.tokenize(s0)
        ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']

    Examples using `strip_handles` and `reduce_len parameters`:

        >>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
        >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
        >>> tknzr.tokenize(s1)
        [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
    """

    def __init__(self, preserve_case=True, reduce_len=False, strip_handles=False):
        self.preserve_case = preserve_case
        self.reduce_len = reduce_len
        self.strip_handles = strip_handles

让我们来看看步骤和正则表达式:

>>> from nltk.tokenize.casual import _replace_html_entities
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> _replace_html_entities(s)
u'the 231358523423423421162 of 3151942776...'

已检查,_replace_html_entities()不是罪魁祸首。

默认情况下会跳过remove_handles()reduce_lengthening()但为了理智,让我们看看:

>>> from nltk.tokenize.casual import _replace_html_entities
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> _replace_html_entities(s)
u'the 231358523423423421162 of 3151942776...'
>>> from nltk.tokenize.casual import remove_handles, reduce_lengthening
>>> remove_handles(_replace_html_entities(s))
u'the 231358523423423421162 of 3151942776...'
>>> reduce_lengthening(remove_handles(_replace_html_entities(s)))
u'the 231358523423423421162 of 3151942776...'

检查过,两个可选功能都没有表现不佳

>>> import re
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> HANG_RE = re.compile(r'([^a-zA-Z0-9])\1{3,}')
>>> HANG_RE.sub(r'\1\1\1', s)
'the 231358523423423421162 of 3151942776...'

<强> KLAR! HANG_RE也被清除了名称

>>> import re
>>> from nltk.tokenize.casual import REGEXPS
>>> WORD_RE = re.compile(r"""(%s)""" % "|".join(REGEXPS), re.VERBOSE | re.I | re.UNICODE)
>>> WORD_RE.findall(s)
['the', '2313585234', '2342342116', '2', 'of', '3151942776', '...']

<强> Achso!这就是分裂出现的地方!

现在让我们深入了解WORD_RE,这是一个正则表达式的元组。

第一个是来自https://gist.github.com/winzig/8894715

的大量网址格式正则表达式

让我们逐一介绍它们:

>>> from nltk.tokenize.casual import REGEXPS
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS), re.VERBOSE | re.I | re.UNICODE)
>>> s = 'the 231358523423423421162 of 3151942776...'
>>> patt.findall(s)
['the', '2313585234', '2342342116', '2', 'of', '3151942776', '...']
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[:1]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
[]
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[:2]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
['2313585234', '2342342116', '3151942776']
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[1:2]), re.VERBOSE | re.I | re.UNICODE)
>>> patt.findall(s)
['2313585234', '2342342116', '3151942776']

啊哈!好像来自REGEXPS的第二个正则表达式引起了问题!!

如果我们查看https://github.com/alvations/nltk/blob/develop/nltk/tokenize/casual.py#L122

# The components of the tokenizer:
REGEXPS = (
    URLS,
    # Phone numbers:
    r"""
    (?:
      (?:            # (international)
        \+?[01]
        [\-\s.]*
      )?
      (?:            # (area code)
        [\(]?
        \d{3}
        [\-\s.\)]*
      )?
      \d{3}          # exchange
      [\-\s.]*
      \d{4}          # base
    )"""
    ,
    # ASCII Emoticons
    EMOTICONS
    ,
    # HTML tags:
    r"""<[^>\s]+>"""
    ,
    # ASCII Arrows
    r"""[\-]+>|<[\-]+"""
    ,
    # Twitter username:
    r"""(?:@[\w_]+)"""
    ,
    # Twitter hashtags:
    r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
    ,
    # email addresses
    r"""[\w.+-]+@[\w-]+\.(?:[\w-]\.?)+[\w-]"""
    ,
    # Remaining word types:
    r"""
    (?:[^\W\d_](?:[^\W\d_]|['\-_])+[^\W\d_]) # Words with apostrophes or dashes.
    |
    (?:[+\-]?\d+[,/.:-]\d+[+\-]?)  # Numbers, including fractions, decimals.
    |
    (?:[\w_]+)                     # Words without apostrophes or dashes.
    |
    (?:\.(?:\s*\.){1,})            # Ellipsis dots.
    |
    (?:\S)                         # Everything else that isn't whitespace.
    """
    )

来自REGEXP的第二个正则表达式试图将数字解析为电话号码:

# Phone numbers:
    r"""
    (?:
      (?:            # (international)
        \+?[01]
        [\-\s.]*
      )?
      (?:            # (area code)
        [\(]?
        \d{3}
        [\-\s.\)]*
      )?
      \d{3}          # exchange
      [\-\s.]*
      \d{4}          # base
    )"""

模式尝试识别

  • 可选地,第一个数字将与国际代码匹配。
  • 接下来的3位数字作为区号
  • 可选地后跟短划线
  • 然后再增加3位数字(电信)交换代码
  • 另一个可选的破折号
  • 最后是4位数的基本电话号码。

有关详细说明,请参阅https://regex101.com/r/BQpnsg/1

这就是为什么它试图将连续的数字分成10位数块!!

但请注意这个怪癖,因为电话号码正则表达式是硬编码的,所以可以用\d{3}-d{3}-\d{4}\d{10}模式捕捉真实的电话号码,但如果破折号按其他顺序排列,则不会起作用:

>>> from nltk.tokenize.casual import REGEXPS
>>> patt = re.compile(r"""(%s)""" % "|".join(REGEXPS[1:2]), re.VERBOSE | re.I | re.UNICODE)
>>> s = '231-358-523423423421162'
>>> patt.findall(s)
['231-358-5234', '2342342116']
>>> s = '2313-58-523423423421162'
>>> patt.findall(s)
['5234234234']

我们可以解决吗?

请参阅https://github.com/nltk/nltk/issues/1799

答案 1 :(得分:1)

TweetTokenizer正则表达式的一部分可以识别每种可以想象的格式的电话号码(在本文档中搜索 #Phone numbers:http://www.nltk.org/_modules/nltk/tokenize/casual.html#TweetTokenizer)。一些10位数字看起来像是10位数的电话号码。这就是为什么他们被转换成单独的代币。