Question

到目前为止，我从未对正则表达非常困难。我希望解决方案不明显，因为我可能花了几个小时来解决这个问题。

这是我的字符串：

<b>Carson Daly</b>: <a href="https://rads.stackoverflow.com/amzn/click/com/B009DA74O8" rel="nofollow noreferrer">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'

我想提取＆＃39; Soko＆＃39;和Jacob Escobedo＆＃39;作为单个字符串。如果我采取两种不同的提取模式，那对我来说没问题。

我试过＆＃34; \ s（[A-Za-z0-9] {1}。+？），＆＃34;和其他正则表达式的更改，以获得我想要的数据，但我没有成功。任何帮助表示赞赏。

名称永远不会使用相同的标记或相同的符号。唯一一个始终位于名称之前的是空格（\ s）。

这是另一个字符串作为例子：

<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>

Answer 1

另一种方法是使用HTML解析器解析字符串，例如lxml。

例如，您可以通过检查b和Carson Daly兄弟姐妹，使用xpath查找br代码与preceding文本和following代码之间的所有内容：

from lxml.html import fromstring

l = [
    """<b>Carson Daly</b>: <a href="http://rads.stackoverflow.com/amzn/click/B009DA74O8">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
    """<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]

for html in l:
    tree = fromstring(html)
    results = ''
    for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
        if not isinstance(element, str):
            results += element.text.strip()
        else:
            text = element.strip(':')
            if text:
                results += text.strip()

    print results.split(', ')

打印：

['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']

Answer 2

如果您想在regex中执行此操作（以及该主题的所有免责声明），则以下正则表达式适用于您的字符串。但是，请注意您需要从捕获组1中检索匹配项。在online demo中，确保查看右下方窗格中的第1组捕获。：）

<[^<]*</[^>]*>|<.*?>|((?<=,\s)\w[\w ]*\w|\w[\w ]*\w(?=,))

基本上，左边的替换（由|分隔）我们匹配我们不想要的所有内容，然后右边的最后一个括号捕获我们想要的东西。

这是关于matching a pattern except in certain situations的这个问题的应用（请参阅实现细节，包括Python代码的链接）。

正则表达式：使用Python在String中查找名称

2 个答案: