我想要做的是找到网页的所有超链接,这是我到目前为止但它不起作用
from urllib.request import urlopen
def findHyperLinks(webpage):
link = "Not found"
encoding = "utf-8"
for webpagesline in webpage:
webpagesline = str(webpagesline, encoding)
if "<a href>" in webpagesline:
indexstart = webpagesline.find("<a href>")
indexend = webpagesline.find("</a>")
link = webpagesline[indexstart+7:indexend]
return link
return link
def main():
address = input("Please enter the adress of webpage to find the hyperlinks")
try:
webpage = urlopen(address)
link = findHyperLinks(webpage)
print("The hyperlinks are", link)
webpage.close()
except Exception as exceptObj:
print("Error:" , str(exceptObj))
main()
答案 0 :(得分:4)
您的代码中存在多个问题。其中之一是您尝试查找包含现在,空白和唯一href
属性的链接:<a href>
。
无论如何,如果你要使用 HTML解析器(好吧,解析HTML),事情会变得更容易和可靠。使用BeautifulSoup
的示例:
from bs4 import BeautifulSoup
from urllib.request import urlopen
soup = BeautifulSoup(urlopen(address))
for link in soup.find_all("a", href=True):
print(link["href"], link.get_text())
答案 1 :(得分:0)
如果没有BeautifulSoap,您可以使用RegExp和简单的功能。
from urllib.request import urlopen
import re
def find_link(url):
response = urlopen(url)
res = str(response.read())
my_dict = re.findall('(?<=<a href=")[^"]*', res)
for x in my_dict:
# simple skip page bookmarks, like #about
if x[0] == '#':
continue
# simple control absolute url, like /about.html
# also be careful with redirects and add more flexible
# processing, if needed
if x[0] == '/':
x = url + x
print(x)
find_link('http://cnn.com')