Question

我试图在python中对一个句子进行标记时删除标点符号，但我有几个“condtitions”，我希望它忽略使用标点符号的标记。一些示例是当我看到URL，电子邮件地址或某些符号旁边没有空格时。例如：

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")

tokenizer.tokenize("please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode")

现在输出看起来像

['please'，'help'，'me'，'ignore'，'punctuation'，'like'，'或'，'but'， 'at'，'the'，'same'，'time'，“not not”，“ignore”，“if”，“it”，“look”， 'like'，'a'，'url'，'i'，'e'，'google'，'com'，'或'，'google'，'co'， 'uk'，'有时'，'我'，'也'，'想要'，'条件'，'在哪里'，'我'， 'see'，'an'，'equals'，'sign'，'between'，'words'，'such'，'as'， 'myname'，'shecode']

但我真正希望它看起来像是

['please'，'help'，'me'，'ignore'，'punctuation'，'like'，'或'，'but'， 'at'，'the'，'same'，'time'，“not not”，“ignore”，“if”，“it”，“look”， 'like'，'a'，'url'，'i'，'e'，'google.com'，'或'，'google.co.uk'， '有时'，'我'，'也'，'想要'，'条件'，'在哪里'，'我'，'看'， 'an'，'equals'，'sign'，'between'，'words'，'such'，'as'， 'MYNAME = shecode']

Answer 1

将正则表达式更改为以下表达式

tokenizer = RegexpTokenizer("[\w+.]+")

在正则表达式中.表示任何字符。

因此，在您的代码中，它也会在.上分裂。因此，新的正则表达式将阻止.

上的分裂

Answer 2

尝试使用此代码，如果它适用于您。

from nltk.tokenize import word_tokenize
punct_list = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
s = "please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode"
print [i.strip("".join(punct_list)) for i in word_tokenize(s) if i not in punct_list]

同时检查此How to remove punctuation?

Answer 3

您可以使用更复杂的正则表达式标记，例如来自nltk.word_tokenize的TreebankTokenizer，请参阅How do I tokenize a string sentence in NLTK?：

>>> from nltk import word_tokenize
>>> text ="please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode"
>>> word_tokenize(text)
['please', 'help', 'me', 'ignore', 'punctuation', 'like', '.', 'or', ',', 'but', 'at', 'the', 'same', 'time', 'do', "n't", 'ignore', 'if', 'it', 'looks', 'like', 'a', 'url', 'i.e', '.', 'google.com', 'or', 'google.co.uk', '.', 'Sometimes', 'I', 'also', 'want', 'conditions', 'where', 'I', 'see', 'an', 'equals', 'sign', 'between', 'words', 'such', 'as', 'myname=shecode']

如果您想删除停用词，请参阅Stopword removal with NLTK

>>> from string import punctuation
>>> from nltk.corpus import stopwords
>>> from nltk import word_tokenize

>>> stoplist = stopwords.words('english') + list(punctuation)

>>> text ="please help me ignore punctuation like . or , but at the same time don't ignore if it looks like a url i.e. google.com or google.co.uk. Sometimes I also want conditions where I see an equals sign between words such as myname=shecode"

>>> word_tokenize(text)
['please', 'help', 'me', 'ignore', 'punctuation', 'like', '.', 'or', ',', 'but', 'at', 'the', 'same', 'time', 'do', "n't", 'ignore', 'if', 'it', 'looks', 'like', 'a', 'url', 'i.e', '.', 'google.com', 'or', 'google.co.uk', '.', 'Sometimes', 'I', 'also', 'want', 'conditions', 'where', 'I', 'see', 'an', 'equals', 'sign', 'between', 'words', 'such', 'as', 'myname=shecode']

>>> [token for token in word_tokenize(text) if token not in stoplist]
['please', 'help', 'ignore', 'punctuation', 'like', 'time', "n't", 'ignore', 'looks', 'like', 'url', 'i.e', 'google.com', 'google.co.uk', 'Sometimes', 'I', 'also', 'want', 'conditions', 'I', 'see', 'equals', 'sign', 'words', 'myname=shecode']

Python - 具有条件

3 个答案: