我正在抓取的页面是link。我想得到所有a href
篇论文。结果代码如下:
import urllib2
import requests
from bs4 import BeautifulSoup
response = requests.get('http://ijcai.org/proceedings/2011')
soup = BeautifulSoup(response.content, 'html.parser')
page = soup.find('div', class_ ='field-item even')
tree = [child for child in page.children]
但是当我尝试:tree[-1]
时,我得到了:
<a href="Erratum049.html">Erratum</a>
确实它只是放在页面的一半。我什么时候无法获得该页面的其余部分?你有什么想法吗?提前谢谢!
答案 0 :(得分:1)
此网页的HTML 格式不正确,使用different parser,例如html5lib
(需要安装html5lib
):
soup = BeautifulSoup(response.content, 'html5lib')
或lxml
(需要安装lxml
):
soup = BeautifulSoup(response.content, 'lxml')
现在tree[-1]
将是页面上的最后一段:
<p><a href="http://ijcai.org/Proceedings/11/Papers/507.pdf">Index</a> / 2871</p>
我还会改进你提取链接的方式:
links = [a["href"] for a in soup.select(".field-item a")]