使用nltk Tokenizing保留URL

时间:2016-02-04 08:10:31

标签: python nltk tokenize

我使用nltk来标记单词。但我想保留句子中的网址。 例如:

输入

It’s word1 word2 https://www.google.com. Word3 word4 (word5). Word6 word7 http://visjs.org/#gallery word8. Word9 word10 (https://www.baidu.com). Word11-word12 word13 word14 http://visjs.org/#gallery. 

期望的输出:

It s word1 word2 https://www.google.com Word3 word4 word5 Word6 word7 word8 Word9 word10 https://www.baidu.com Word11 word12 word13 word14 http://visjs.org/#gallery

我使用tokens=WhitespaceTokenizer().tokenize(Input),但它无法删除分隔符和标点符号。任何人都可以提供解决方案吗?

2 个答案:

答案 0 :(得分:1)

nltk有一个令牌符nltk.tokenize.casual_tokenize可以做你想做的事情,虽然它没有做word_tokenize对收缩等的幻想。

临时文档为here

答案 1 :(得分:-1)

>>> text = "It’s word1 word2 https://www.google.com. Word3 word4 (word5). Word6 word7 http://visjs.org/#gallery word8. Word9 word10 (https://www.baidu.com). Word11-word12 word13 word14 http://visjs.org/#gallery."
>>> _text = " ".join([w.strip('.,()') for w in text.split()])
>>> print _text.replace("’s", " 's")
It 's word1 word2 https://www.google.com Word3 word4 word5 Word6 word7 http://visjs.org/#gallery word8 Word9 word10 https://www.baidu.com Word11-word12 word13 word14 http://visjs.org/#gallery

>>> text = "It’s word1 word2 https://www.google.com. Word3 word4 (word5). Word6 word7 http://visjs.org/#gallery word8. Word9 word10 (https://www.baidu.com). Word11-word12 word13 word14 http://visjs.org/#gallery."
>>> _text = text.replace("’s", " 's")
>>> _text = " ".join([w.strip(".,()'") for w in _text.split()])
>>> print _text
It s word1 word2 https://www.google.com Word3 word4 word5 Word6 word7 http://visjs.org/#gallery word8 Word9 word10 https://www.baidu.com Word11-word12 word13 word14 http://visjs.org/#gallery