Question

我需要找到2个连续标题案例单词的组合。

到目前为止，这是我的代码，

text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'

rex=r'[A-Z][a-z]+\s+[A-Z][a-z]+'

re.findall(rex,text)

这给了我，

['Moh Shai', 'This Is', 'Python Code', 'Needs Some']

但是，我需要所有的组合。像，

['Moh Shai', 'This Is', 'Python Code', 'Needs Some','Some Expertise']

有人可以帮忙吗？

Answer 1

您可以将正则表达式预测与re.finditer函数结合使用，以获得所需的结果：

import re

text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'
rex=r'(?=([A-Z][a-z]+\s+[A-Z][a-z]+))'

matches = re.finditer(rex,text)
results = [match.group(1) for match in matches]

现在结果将包含您需要的信息：

>>> results
['Moh Shai', 'This Is', 'Python Code', 'Needs Some', 'Some Expertise']

编辑：对于它的价值，你甚至不需要finditer函数。您可以使用上一行re.findall(rex,text)替换那两个底部两行，以获得相同的效果。

Answer 2

我通过它的头衔来到这个问题，并且对解决方案不符合我的预期感到失望。

接受的答案仅适用于完全 2个字

的标题

此代码将返回标题大小写中的所有标记，而不对标题中的单词数量进行任何假设

import re, collections
def title_case_to_token(c):
    totoken = lambda s: s[0] + "<" + s[1:-2].replace(" ","_") + ">" + s[-2:]
    tokenized = re.sub("([\s\.\,;]([A-Z][a-z]+[\s\.\,;])+[^A-Z])", lambda m: totoken(m.group(0))," " + c + " x")[1:-2]
    tokens = collections.Counter(re.compile("<\w+>").findall(tokenized))
    return (tokens, tokenized)

例如

text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'
tokens, tokenized = title_case_to_token(text)

tokens的价值

Counter({'<Hi>': 1, '<Moh_Shai>': 1, '<This_Is>': 1, '<Python_Code>': 1, '<Regex>': 1, '<Needs_Some_Expertise>': 1})

请注意`Needs_Some_Expertise`也被此正则表达式捕获，它有3个字

tokenized的价值

<Hi> my name is <Moh_Shai> and <This_Is> a <Python_Code> with <Regex> and <Needs_Some_Expertise>

Answer 3

如果您可以安装第三方模块，最简单的方法是regex module，它支持overlapped=True上的findall()标记。

标题案例的正则表达式 - Python

3 个答案:

请注意`Needs_Some_Expertise`也被此正则表达式捕获，它有3个字

标题案例的正则表达式 - Python

3 个答案:

请注意Needs_Some_Expertise也被此正则表达式捕获，它有3个字

请注意`Needs_Some_Expertise`也被此正则表达式捕获，它有3个字