在没有BeautifulSoup的python中查找页面的超链接

时间:2015-12-12 03:46:14

标签: python regex web-scraping

我想要做的是找到网页的所有超链接,这是我到目前为止但它不起作用

from urllib.request import urlopen

def findHyperLinks(webpage):
    link = "Not found"
    encoding = "utf-8"
    for webpagesline in webpage:
        webpagesline = str(webpagesline, encoding)
        if "<a href>" in webpagesline:
            indexstart = webpagesline.find("<a href>")
            indexend = webpagesline.find("</a>")
            link = webpagesline[indexstart+7:indexend]
            return link
    return link

def main():
    address = input("Please enter the adress of webpage to find the hyperlinks")
    try:
        webpage = urlopen(address)
        link =  findHyperLinks(webpage)
        print("The hyperlinks are", link)

        webpage.close()
    except Exception as exceptObj:
        print("Error:" , str(exceptObj))

main()

2 个答案:

答案 0 :(得分:4)

您的代码中存在多个问题。其中之一是您尝试查找包含现在,空白和唯一href属性的链接:<a href>

无论如何,如果你要使用 HTML解析器(好吧,解析HTML),事情会变得更容易和可靠。使用BeautifulSoup的示例:

from bs4 import BeautifulSoup
from urllib.request import urlopen

soup = BeautifulSoup(urlopen(address))
for link in soup.find_all("a", href=True):
    print(link["href"], link.get_text())

答案 1 :(得分:0)

如果没有BeautifulSoap,您可以使用RegExp和简单的功能。

from urllib.request import urlopen
import re

def find_link(url):
    response = urlopen(url)
    res = str(response.read())
    my_dict = re.findall('(?<=<a href=")[^"]*', res)

    for x in my_dict:
        # simple skip page bookmarks, like #about
        if x[0] == '#':
            continue

        # simple control absolute url, like /about.html
        # also be careful with redirects and add more flexible
        # processing, if needed
        if x[0] == '/':
            x = url + x

        print(x)

find_link('http://cnn.com')