Question

以下是一个例子：

<li><a href="link" target="_parent">1. Tips and tricks</a></li>

正则表达式：

/tips(?![^<]*>)/ig

匹配提示一词。

我想要做的是能够匹配周围的文本，可能在另一组中？

所以匹配可以是e.g. ["1. Tips and tricks", "Tips"].

您可以测试here

Answer 1

我认为你正试图解决这个问题，

>>> import re
>>> str = '<li><a href="link" target="_parent">1. Tips and tricks</a></li>'
>>> m = re.findall(r'((?<=>)\d+\.\s*(Tips)[^<]*)', str)
>>> m
[('1. Tips and tricks', 'Tips')]

或

>>> str = """ ... <li> ... <a href="link" target="_parent"> ... 1. Tips and tricks ... </a> ... </li>""" >>> m = re.findall(r'\s*<a[^>]*>\n(\s*\S*\s*(\S*)[^\n]*)', str) >>> m [('1. Tips and tricks', 'Tips')]

Answer 2

根据您的评论，我认为使用BeautifulSoup然后使用re.split清理一下会更简单：

from bs4 import BeautifulSoup
import re

html = """<li class="selected ">
<a href="http://localhost:8888/translate_url" target="_parent">
          Learn the Basics: get iniciared
        </a>
<ul class="subtopics">
<li>
<a href="http://localhost:8888/translate_url" target="_parent">
                Tips and tricks
                </a>
</li>
<li>
<a href="http://localhost:8888/translate_url" target="_parent">
                Use bookmarks
                </a>
</li>"""

soup = BeautifulSoup(html)
text = re.split(r'\s{2,}', soup.get_text().strip())

输出：

['Learn the Basics: get iniciared', 'Tips and tricks', 'Use bookmarks']

soup.get_text()获取页面中的所有文字。然后使用strip()删除前导和尾随空格，这样就不会在文本列表中找到空字符串。

Answer 3

re模块的Python文档声明：

子组从左到右编号，从1向上编号。组可以嵌套;要确定数字，只需计算从左到右的左括号字符。

因此，例如，以下（丑陋）模式将匹配一个组中的周围文本和示例链接中的目标词：

/[^\n\s](.*basics(?![^<]*>).*)\n/ig

您可以针对您的情况进行优化！

编辑：使用正则表达式解析HTML仍然是一个非常糟糕的主意，像beautifulsoup这样的东西会更健壮。

匹配组匹配周围的文本

3 个答案: