使用Python regexp在括号中提取所有格词和单词

时间:2015-11-15 02:57:11

标签: python regex

很容易单独提取它们,

re.findall(r"\((\w+)\)", "It's Jane's cat Jack (male)") #1
re.findall("(?<=\()\w+(?=\))", "It's Jane's cat Jack (male)") #2
# ['male']

re.findall(r"\w+(?='s)", "It's Jane's cat Jack (male)")
# ['It', 'Jane']

re.findall(r"\S+", "It's Jane's cat Jack (male)")
# ["It's", "Jane's", 'cat', 'Jack (male)']

然而,这让我感到困惑

re.findall(r"\((\w+)\)|\w+(?='s)|\S+", "It's Jane's cat Jack (male)") #1
re.findall(r"(?<=\()\w+(?=\))|\w+(?='s)|\S+", "It's Jane's cat Jack (male)") #2
# ['It', "'s", 'Jane', "'s", 'cat', 'Jack', '(male)']

永远不会产生:

# ['It', 'Jane', 'cat', 'Jack', 'male']
顺便说一下,#1还是#2更好?它们产生相同的结果。

感谢观看&amp;回复

1 个答案:

答案 0 :(得分:2)

您可以尝试这样做,因为\S+会匹配一个或多个非空格字符,这也会匹配剩余的's。而且在比较你给出的两种方法时,你必须使用第二种方法,因为第一种方法应该返回male字符串和许多空字符串,因为你的正则表达式中存在捕获组。

>>> re.findall(r"(?<=\()\w+(?=\))|\w+(?='s)|(?<!\S)\w+(?!\S)", "It's Jane's cat Jack (male)")
['It', 'Jane', 'cat', 'Jack', 'male']

>>> [i for i in re.split(r"\s*(?:[()]|'s|\s)\s*", "It's Jane's cat Jack (male)") if i]
['It', 'Jane', 'cat', 'Jack', 'male']