Question

我想要做的是找到网页的所有超链接，这是我到目前为止但它不起作用

from urllib.request import urlopen

def findHyperLinks(webpage):
    link = "Not found"
    encoding = "utf-8"
    for webpagesline in webpage:
        webpagesline = str(webpagesline, encoding)
        if "<a href>" in webpagesline:
            indexstart = webpagesline.find("<a href>")
            indexend = webpagesline.find("</a>")
            link = webpagesline[indexstart+7:indexend]
            return link
    return link

def main():
    address = input("Please enter the adress of webpage to find the hyperlinks")
    try:
        webpage = urlopen(address)
        link =  findHyperLinks(webpage)
        print("The hyperlinks are", link)

        webpage.close()
    except Exception as exceptObj:
        print("Error:" , str(exceptObj))

main()

Answer 1

您的代码中存在多个问题。其中之一是您尝试查找包含现在，空白和唯一href属性的链接：<a href>。

无论如何，如果你要使用 HTML解析器（好吧，解析HTML），事情会变得更容易和可靠。使用BeautifulSoup的示例：

from bs4 import BeautifulSoup
from urllib.request import urlopen

soup = BeautifulSoup(urlopen(address))
for link in soup.find_all("a", href=True):
    print(link["href"], link.get_text())

Answer 2

如果没有BeautifulSoap，您可以使用RegExp和简单的功能。

from urllib.request import urlopen
import re

def find_link(url):
    response = urlopen(url)
    res = str(response.read())
    my_dict = re.findall('(?<=<a href=")[^"]*', res)

    for x in my_dict:
        # simple skip page bookmarks, like #about
        if x[0] == '#':
            continue

        # simple control absolute url, like /about.html
        # also be careful with redirects and add more flexible
        # processing, if needed
        if x[0] == '/':
            x = url + x

        print(x)

find_link('http://cnn.com')

在没有BeautifulSoup的python中查找页面的超链接

2 个答案: