我想从this stats canada webpage中删除一些简单的网络链接。我想得到所有类“indent-3”和类型li的链接。我认为代码将如下:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
#stats canada webpage
base_page = ("http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/Geo-index-eng.cfm?TABID=5&LANG=E&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=0&GID=0&GK=0&GRP=1&PID=99015&PRID=0&PTYPE=88971,97154&S=0&SHOWALL=0&SUB=0&Temporal=2006&THEME=70&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0")
http = httplib2.Http()
status, response = http.request(base_page)
soup = BeautifulSoup(response)
links = soup.find_all("li", class_="indent-3")
但是当我运行这段代码时,链接是一个长度为13的列表,它应该是288的长度。当我这样做时
soup.get_text()
汤只从网页的很小一部分检索文本。最高达Brackley,页面上的条目号为428。
为什么我没有收到大部分网页?
编辑:因为看起来像BeautifulSoup不是问题,我尝试将网站的html文件保存为webfile.html。然后我直接读到python。
f = file("webfile.html", 'r')
page = f.read()
soup = BeautifulSoup(page)
links = soup.find_all("li", class_="indent-3")
我仍然只获得13个链接。我不知道我做错了什么......
答案 0 :(得分:0)
不是BeautifulSoup
,而是你提出的要求。
使用requests
并提供User-Agent
标题为我工作:
import requests
from bs4 import BeautifulSoup
#stats canada webpage
base_page = "http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/Geo-index-eng.cfm?TABID=5&LANG=E&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=0&GID=0&GK=0&GRP=1&PID=99015&PRID=0&PTYPE=88971,97154&S=0&SHOWALL=0&SUB=0&Temporal=2006&THEME=70&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
response = requests.get(base_page, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
soup = BeautifulSoup(response.content)
links = soup.find_all("li", class_="indent-3")
print len(links) # prints 288