Question

如何在除＃字符之外的任何标点符号和空格处分割字符串？

tweet="I went on #Russia to see the world cup. We lost!"

我想像这样在下面分割字符串：

["I", "went", "to", "#Russia", "to, "see", "the", "world", "cup", "We","lost"]

我的尝试

p = re.compile(r"\w+|[^\w\s]", re.UNICODE)

无效，因为它创建的是“俄罗斯”而不是“＃俄罗斯”

Answer 1

只需包含“＃”

p = re.compile(r"[\w#]+", re.UNICODE)

Answer 2

具有re.findall功能：

tweet="I went on #Russia to see the world cup. We lost!"
words = re.findall(r'[\w#]+', tweet)
print(words)

输出：

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']

Answer 3

使用re.sub

例如：

import re
tweet="I went on #Russia to see the world cup. We lost!"
res = list(map(lambda x: re.sub("[^\w#]", "", x), tweet.split()))
print(res)

输出：

['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']