我试图逐行从网页中提取链接及其文本,然后插入文本并链接到字典中。不使用美丽的汤或正则表达式。
我一直收到这个错误:
错误:
Traceback (most recent call last):
File "F:/Homework7-2.py", line 13, in <module>
link2 = link1.split("href=")[1]
IndexError: list index out of range
代码:
import urllib.request
url = "http://www.facebook.com"
page = urllib.request.urlopen(url)
mylinks = {}
links = page.readline().decode('utf-8')
for items in links:
links = page.readline().decode('utf-8')
if "a href=" in links:
links = page.readline().decode('utf-8')
link1 = links.split(">")[0]
link2 = link1.split("href=")[1]
mylinks = link2
print(mylinks)
答案 0 :(得分:0)
import requests
from bs4 import BeautifulSoup
r = requests.get("http://stackoverflow.com/questions/29336915/python-scraping-webpages")
# find all a tags with href attributes
for a in BeautifulSoup(r.content).find_all("a",href=True):
# print each href
print(a["href"])
显然,这是一个非常广泛的示例,但会让您入门,如果您想要特定的网址,您可以将搜索范围缩小到某些元素,但对于所有网页都会有所不同。您找不到比requests和BeautifulSoup
更容易使用的解析工具