Question

我的问题是参考以下内容： How to extract URL from HTML anchor element using Python3?

如果我不知道确切的网址并且只有一个关键字应该出现在网址中，该怎么办？那么如何从页面源中提取网址？

Answer 1

使用 HTML解析器。

如果是BeautifulSoup，您可以传递function作为关键字参数值：

from bs4 import BeautifulSoup

word = "test"
data = "your HTML here"
soup = BeautifulSoup(data)

for a in soup.find_all('a', href=lambda x: x and word in x):
    print(a['href'])

或者，regular expression：

import re

for a in soup.find_all('a', href=re.compile(word)):
    print(a['href'])

或者，使用CSS selector：

for a in soup.select('a[href^="{word}"]'.format(word=word)):
    print(a['href'])

Answer 2

尝试使用正则表达式

import re
re.findall(r'(?i)href=["\']([^\s"\'<>]+)', content)

使用Python 3从源代码中提取URL

2 个答案: