Question

I need to read many pages from a website and extract all links with class "active" using a regex. This tags can have the class attr BEFORE or AFTER the HREF value.

My code is:

    try:
        p = requests.get(url, timeout=4.0)
    except:
        p = None
    if p and p.content and p.status_code < 400:
        canonical_url = re.search('<a class="active" href="(.*)?"', p.content, flags=re.MULTILINE|re.IGNORECASE|re.DOTALL|re.UNICODE)

but with this regex I can catch only links with class active BEFORE the HREF and not AFTER. Thanks.

Answer 1

鉴于OP在问题下面的注释中指定了以下内容，可以使用正则表达式。但要小心，因为当尝试解析HTML时，正则表达式可以轻松中断。

我使用的是BS4，但是我的老板让我使用正则表达式，因为BS4提取一个简单的链接是一种过度杀伤

See regex in use here

<a\b(?=[^>]* class="[^"]*(?<=[" ])active[" ])(?=[^>]* href="([^"]*))

<a按字面意思匹配
\b断言位置为单词边界
(?=[^>]* class="[^"]*(?<=[" ])active[" ])确定以下内容的正面预测。
- [^>]*匹配除>以外的任何字符
- class="按字面意思匹配
- [^"]*匹配除"以外的任何字符
- (?<=[" ])正面看后方确保集合中的字符前面是什么
- active按字面意思匹配
- [" ]匹配集合中的任何一个字符
(?=[^>]* href="([^"]*))确定后续匹配的正向前瞻
- [^>]*匹配除>以外的任何字符
- href="按字面意思匹配
- ([^"]*)将"除<a class="active" href="something"> <a href="something" class="active"> <a href="something" class="another-class active some-other-class"> <a class="inactive" href="something"> <a not-class="active" href="something"> <a class="active" not-href="something">之外的任何字符捕获到捕获组1

鉴于以下样本，只匹配前3个：

{{1}}

从<a> tag with class uses Regex

1 个答案: