Python抓取网页

时间:2015-03-30 01:05:46

标签: python web-scraping webpage

我试图逐行从网页中提取链接及其文本,然后插入文本并链接到字典中。不使用美丽的汤或正则表达式。

我一直收到这个错误:

错误:

 Traceback (most recent call last):
 File "F:/Homework7-2.py", line 13, in <module>
 link2 = link1.split("href=")[1]
 IndexError: list index out of range

代码:

import urllib.request
url = "http://www.facebook.com" 
page = urllib.request.urlopen(url)
mylinks = {}
links = page.readline().decode('utf-8')


for items in links:
  links = page.readline().decode('utf-8')
  if "a href=" in links:
     links = page.readline().decode('utf-8')
     link1 = links.split(">")[0]
     link2 = link1.split("href=")[1]
     mylinks = link2
     print(mylinks)

1 个答案:

答案 0 :(得分:0)

import requests

from bs4 import BeautifulSoup

r = requests.get("http://stackoverflow.com/questions/29336915/python-scraping-webpages")
#  find all a tags with href attributes
for a in BeautifulSoup(r.content).find_all("a",href=True):
    # print each href
    print(a["href"])

显然,这是一个非常广泛的示例,但会让您入门,如果您想要特定的网址,您可以将搜索范围缩小到某些元素,但对于所有网页都会有所不同。您找不到比requestsBeautifulSoup

更容易使用的解析工具