Question

我正在抓取的页面是link。我想得到所有a href篇论文。结果代码如下：

import urllib2
import requests
from bs4 import BeautifulSoup

response = requests.get('http://ijcai.org/proceedings/2011')
soup = BeautifulSoup(response.content, 'html.parser')
page = soup.find('div', class_ ='field-item even')
tree = [child for child in page.children]

但是当我尝试：tree[-1]时，我得到了：

<a href="Erratum049.html">Erratum</a>

确实它只是放在页面的一半。我什么时候无法获得该页面的其余部分？你有什么想法吗？提前谢谢！

Answer 1

此网页的HTML 格式不正确，使用different parser，例如html5lib（需要安装html5lib）：

soup = BeautifulSoup(response.content, 'html5lib')

或lxml（需要安装lxml）：

soup = BeautifulSoup(response.content, 'lxml')

现在tree[-1]将是页面上的最后一段：

<p><a href="http://ijcai.org/Proceedings/11/Papers/507.pdf">Index</a> / 2871</p>

我还会改进你提取链接的方式：

links = [a["href"] for a in soup.select(".field-item a")]

无法使用python-BeautifulSoup

1 个答案: