Question

我想在python

中的文本字符串中获取单词

s = "The saddest aspect of life right now is: science gathers knowledge faster than society gathers wisdom."

result = re.sub("\b[^\w\d_]+\b", " ",  s ).split()
print result

我得到了：

['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is:', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']

我怎样才能得到＆＃34;＆＃34;而不是＆＃34;是：＆＃34;在碰巧包含:的字符串上？我认为使用\b就够了......

Answer 1

您忘了将其作为原始字符串文字（r".."）

>>> import re
>>> s = "The saddest aspect of life right now is: science gathers knowledge faster than society gathers wisdom."
>>> re.sub("\b[^\w\d_]+\b", " ",  s ).split()
['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is:', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']
>>> re.sub(r"\b[^\w\d_]+\b", " ",  s ).split()
['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']

Answer 2

我认为您打算将原始字符串传递给re.sub（请注意r）。

result = re.sub(r"\b[^\w\d_]+\b", " ",  s ).split()

返回：

['The', 'saddest', 'aspect', 'of', 'life', 'right', 'now', 'is', 'science', 'gathers', 'knowledge', 'faster', 'than', 'society', 'gathers', 'wisdom.']

Answer 3

正如其他答案指出的那样，您需要使用r来定义原始字符串文字，如下所示：(r"...")

如果您想剥离句点，我相信您可以将正则表达式简化为：

result = re.sub(r"[^\w' ]", " ", s ).split()

您可能知道\w元字符会删除任何不是a-z，A-Z，0-9

的字符串

所以，如果你能预料到你的句子没有数字可以解决问题。

正确剥离：使用正则表达式的char

3 个答案: