如何用正则表达式划分单词边界?

时间:2016-05-15 11:17:54

标签: python regex nlp

我试图这样做:

import re
sentence = "How are you?"
print(re.split(r'\b', sentence))

结果是

[u'How are you?']

我想要像[u'How', u'are', u'you', u'?']这样的东西。如何实现这一目标?

2 个答案:

答案 0 :(得分:9)

不幸的是,Python无法通过空字符串进行拆分。

要解决此问题,您需要使用findall代替split

实际上\b只意味着词边界。

相当于(?<=\w)(?=\W)|(?<=\W)(?=\w)

这意味着,以下代码可以工作:

import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))

答案 1 :(得分:2)

import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)

<强>输出:

['How', 'are', 'you', '?']

Ideone Demo

Regex101 Demo

正则表达式说明:

"[\w']+|[.,!?;]"

    1st Alternative: [\w']+
        [\w']+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            \w match any word character [a-zA-Z0-9_]
            ' the literal character '
    2nd Alternative: [.,!?;]
        [.,!?;] match a single character present in the list below
            .,!?; a single character in the list .,!?; literally