Question

def parsehttp(url):
    r = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(r, 'lxml')


    for link in soup.find_all('a'):
        href = link.attrs.get("href")
        print(href)

我希望能够从网站中提取所有传出链接，但是，我现在拥有的代码同时返回相对链接和传出链接，我只想要传出链接。不同之处在于传出链接中有 https 部分，而相对链接没有。我还想获得每个链接附带的“标题”部分。

Answer 1

您可以使用正则表达式：

for link in soup.findAll('a', attrs={'href': re.compile("^(http|https)://")}):
    href = link.attrs.get("href")
    if href is not None:
        print(href)

Answer 2

for link in soup.find_all('a'):
    href = link.attrs.get("href", "")
    if not href.startwith("https://"):
        continue
    
    print(href)

Answer 3

您可以检查 href 的前 5 个字符是否为 https 来识别：

if href[0:5] == "https":
   #outgoing link
else:
   #incoming link

如何从python中的网站提取传出链接？

3 个答案: