在标点符号上拆分字符串(标签除外)

时间:2018-07-09 10:33:26

标签: python regex string split

如何在除#字符之外的任何标点符号和空格处分割字符串?

tweet="I went on #Russia to see the world cup. We lost!"

我想像这样在下面分割字符串:

["I", "went", "to", "#Russia", "to, "see", "the", "world", "cup", "We","lost"]

我的尝试

p = re.compile(r"\w+|[^\w\s]", re.UNICODE)

无效,因为它创建的是“俄罗斯”而不是“#俄罗斯”

3 个答案:

答案 0 :(得分:3)

只需包含“#”

p = re.compile(r"[\w#]+", re.UNICODE)

答案 1 :(得分:2)

具有re.findall功能:

tweet="I went on #Russia to see the world cup. We lost!"
words = re.findall(r'[\w#]+', tweet)
print(words)

输出:

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']

答案 2 :(得分:0)

使用re.sub

例如:

import re
tweet="I went on #Russia to see the world cup. We lost!"
res = list(map(lambda x: re.sub("[^\w#]", "", x), tweet.split()))
print(res)

输出:

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']