Question

这是我到目前为止所拥有的

text = "Hello world. It is a nice day today. Don't you think so?"
re.findall('\w{3,}\s{1,}\w{3,}',text)
#['Hello world', 'nice day', 'you think']

期望的输出将是['Hello world'，'nice day'，'今天'，'今天不要'，'不是你'，'你认为']

这可以通过简单的正则表达式完成吗？

Answer 1

map(lambda x: x[0] + x[1], re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))

可能你可以用更短的时间重写lambda（比如'+'）并且BTW'不属于\ w或\ s

Answer 2

这样的事情应该对列表边界进行额外的检查：

>>> text = "Hello world. It is a nice day today. Don't you think so?"
>>> k = text.split()
>>> k
['Hello', 'world.', 'It', 'is', 'a', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> z = [x for x in k if len(x) > 2]
>>> z
['Hello', 'world.', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']

>>> [z[n]+ " " + z[n+1] for n in range(0, len(z)-1, 2)]
['Hello world.', 'nice day', "today. Don't", 'you think']
>>>

Answer 3

您的方法存在两个问题：

既不是\ w也不匹配标点符号。
使用findall将字符串与正则表达式匹配时，将消耗该部分字符串。在上一场比赛结束后立即开始搜索下一场比赛。因此，单词不能包含在两个单独的匹配中。

要解决第一个问题，您需要确定一个单词的含义。正则表达式不适合这种解析。您可能希望查看自然语言解析库。

但是假设您可以提出适合您需求的正则表达式，为了解决第二个问题，您可以使用lookahead assertion来检查第二个单词。这不会返回您想要的整个匹配，但您至少可以使用此方法找到每个单词对中的第一个单词。

 re.findall('\w{3,}(?=\s{1,}\w{3,})',text)
                   ^^^            ^
                  lookahead assertion

Answer 4

import itertools as it
import re 

three_pat=re.compile(r'\w{3}')
text = "Hello world. It is a nice day today. Don't you think so?"
for key,group in it.groupby(text.split(),lambda x: bool(three_pat.match(x))):
    if key:
        group=list(group)       
        for i in range(0,len(group)-1):
            print(' '.join(group[i:i+2]))

# Hello world.
# nice day
# day today.
# today. Don't
# Don't you
# you think

我不清楚你想用标点符号做什么。一方面，看起来您希望删除句点，但要保留单引号。实现删除句点很容易，但在此之前，你会澄清你想要对所有标点符号发生什么吗？

python正则表达式查找所有单词组

4 个答案: