Question

我试图逐行从网页中提取链接及其文本，然后插入文本并链接到字典中。不使用美丽的汤或正则表达式。

我一直收到这个错误：

错误：

 Traceback (most recent call last):
 File "F:/Homework7-2.py", line 13, in <module>
 link2 = link1.split("href=")[1]
 IndexError: list index out of range

代码：

import urllib.request
url = "http://www.facebook.com" 
page = urllib.request.urlopen(url)
mylinks = {}
links = page.readline().decode('utf-8')


for items in links:
  links = page.readline().decode('utf-8')
  if "a href=" in links:
     links = page.readline().decode('utf-8')
     link1 = links.split(">")[0]
     link2 = link1.split("href=")[1]
     mylinks = link2
     print(mylinks)

Answer 1

import requests

from bs4 import BeautifulSoup

r = requests.get("http://stackoverflow.com/questions/29336915/python-scraping-webpages")
#  find all a tags with href attributes
for a in BeautifulSoup(r.content).find_all("a",href=True):
    # print each href
    print(a["href"])

显然，这是一个非常广泛的示例，但会让您入门，如果您想要特定的网址，您可以将搜索范围缩小到某些元素，但对于所有网页都会有所不同。您找不到比requests和BeautifulSoup

更容易使用的解析工具

Python抓取网页

1 个答案: