Question

我正在使用 NLTK in Python 中的标记生成器。

已经有大量的答案已经在论坛上删除标点符号。但是，它们都没有解决所有以下问题：

连续多个符号。例如，句子：他说，＆＃34;那就是它。＆＃34;因为有一个逗号后跟引号，所以令牌器不会删除。＆＃34;在句子里。标记器将给出[＆＃39;他＆＃39;＆＃39;＆＃39;，＆＃39;，＆＃34;＆＃39;，＆＃39;＆＃39;＆＃39;＆＃39; s＆＃39;，＆＃39;它。＆＃39;]而不是[＆＃39;他＆＃39;，＆＃39;说＆＃39;，＆＃39;＆＃39;，＆＃39; s＆＃39;，＆＃39;它＆＃39;]。其他一些例子包括＆＃39; ...＆＃39;，＆＃39; - ＆＃39;，＆＃39;！？＆＃39;，＆＃39;，＆＃34;＆＃39 ;等等。
删除句子末尾的符号。即句子：Hello World。标记器将提供[＆＃39; Hello＆＃39;，＆＃39; World。＆＃39;]而不是[＆＃39; Hello＆＃39;，＆＃39; World＆＃39;]。请注意“世界”一词末尾的句号。其他一些例子包括＆＃39; - ＆＃39;，＆＃39;，＆＃39;在任何角色的开头，中间或结尾。
删除前面和后面带符号的字符。即'*u*', '''','""'

有解决这两个问题的优雅方法吗？

Answer 1

解决方案1：对标记进行标记和剥离标点符号

>>> from nltk import word_tokenize
>>> import string
>>> punctuations = list(string.punctuation)
>>> punctuations
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> punctuations.append("''")
>>> sent = '''He said,"that's it."'''
>>> word_tokenize(sent)
['He', 'said', ',', "''", 'that', "'s", 'it', '.', "''"]
>>> [i for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', "'s", 'it']
>>> [i.strip("".join(punctuations)) for i in word_tokenize(sent) if i not in punctuations]
['He', 'said', 'that', 's', 'it']

解决方案2：删除标点符号然后标记化

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> sent = '''He said,"that's it."'''
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split())
'He said that s it'
>>> " ".join("".join([" " if ch in string.punctuation else ch for ch in sent]).split()).split()
['He', 'said', 'that', 's', 'it']

Answer 2

如果您想一次性对您的字符串进行标记，我认为您唯一的选择是使用nltk.tokenize.RegexpTokenizer。以下方法允许您在删除标点符号之前使用标点符号作为标记来删除字母表中的字符（如第三个要求中所述）。换句话说，此方法将在删除所有标点符号之前删除*u*。

然后，解决这个问题的一种方法是对像这样的差距进行标记：

>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''
>>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World']  # omits *u* per your third requirement

这应符合您在上面指定的所有三个标准。但请注意，此标记生成器不会返回"A"之类的标记。此外，我只对使用标点符号开始和结尾的单个字母进行标记。否则，“去吧。”不会返回令牌。您可能需要以其他方式细化正则表达式，具体取决于您的数据是什么以及您的期望是什么。

如何删除标点符号？

2 个答案: