美丽的汤错误 - 不是"看到"整个网页?

时间:2014-11-04 17:56:02

标签: python web-scraping beautifulsoup

我想从this stats canada webpage中删除一些简单的网络链接。我想得到所有类“indent-3”和类型li的链接。我认为代码将如下:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

#stats canada webpage
base_page = ("http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/Geo-index-eng.cfm?TABID=5&LANG=E&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=0&GID=0&GK=0&GRP=1&PID=99015&PRID=0&PTYPE=88971,97154&S=0&SHOWALL=0&SUB=0&Temporal=2006&THEME=70&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0")

http = httplib2.Http()
status, response = http.request(base_page) 
soup = BeautifulSoup(response)

links = soup.find_all("li", class_="indent-3")

但是当我运行这段代码时,链接是一个长度为13的列表,它应该是288的长度。当我这样做时

soup.get_text()

汤只从网页的很小一部分检索文本。最高达Brackley,页面上的条目号为428。

为什么我没有收到大部分网页?

编辑:因为看起来像BeautifulSoup不是问题,我尝试将网站的html文件保存为webfile.html。然后我直接读到python。

f = file("webfile.html", 'r')
page = f.read()
soup = BeautifulSoup(page)
links = soup.find_all("li", class_="indent-3")

我仍然只获得13个链接。我不知道我做错了什么......

1 个答案:

答案 0 :(得分:0)

不是BeautifulSoup,而是你提出的要求。

使用requests并提供User-Agent标题为我工作:

import requests
from bs4 import BeautifulSoup

#stats canada webpage
base_page = "http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/Geo-index-eng.cfm?TABID=5&LANG=E&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=0&GID=0&GK=0&GRP=1&PID=99015&PRID=0&PTYPE=88971,97154&S=0&SHOWALL=0&SUB=0&Temporal=2006&THEME=70&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"

response = requests.get(base_page, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
soup = BeautifulSoup(response.content)

links = soup.find_all("li", class_="indent-3")
print len(links)  # prints 288