我正在尝试匹配字符串中的所有单词,但URL中带有标点符号的字符串除外。
我尝试了许多变体,但是当它在第二个字符串中工作时,第一个出现错误。
s1 = "My dog is nice! My cat not. www.test.org ?"
s2 = "I am."
regex = r"\b\w+\W* \b"
m1 = re.findall(regex, s1)
m2 = re.findall(regex, s2)
m1的输出正确:
['My ', 'dog ', 'is ', 'nice! ', 'My ', 'cat ', 'not. ']
m2的输出不是我想要的:
['I ']
...但是我想要
['I ', 'am.']
答案 0 :(得分:0)
您需要额外的检查...:
regex = r"\b\w+\W* \b|\b\w+\W$"
...以匹配空格不跟随点结尾的情况。
工作代码:
import re
s1 = "My dog is nice! My cat not. www.test.org ?"
s2 = "I am."
regex = r"\b\w+\W* \b|\b\w+\W$"
m1 = re.findall(regex, s1)
m2 = re.findall(regex, s2)
print(m1) # ['My ', 'dog ', 'is ', 'nice! ', 'My ', 'cat ', 'not. ']
print(m2) # ['I ', 'am.']