如何在除#字符之外的任何标点符号和空格处分割字符串?
tweet="I went on #Russia to see the world cup. We lost!"
我想像这样在下面分割字符串:
["I", "went", "to", "#Russia", "to, "see", "the", "world", "cup", "We","lost"]
我的尝试
p = re.compile(r"\w+|[^\w\s]", re.UNICODE)
无效,因为它创建的是“俄罗斯”而不是“#俄罗斯”
答案 0 :(得分:3)
只需包含“#”
p = re.compile(r"[\w#]+", re.UNICODE)
答案 1 :(得分:2)
具有re.findall
功能:
tweet="I went on #Russia to see the world cup. We lost!"
words = re.findall(r'[\w#]+', tweet)
print(words)
输出:
['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']
答案 2 :(得分:0)
使用re.sub
例如:
import re
tweet="I went on #Russia to see the world cup. We lost!"
res = list(map(lambda x: re.sub("[^\w#]", "", x), tweet.split()))
print(res)
输出:
['I', 'went', 'on', '#Russia', 'to', 'see', 'the', 'world', 'cup', 'We', 'lost']