提取<a> tags using BeautifulSoup

时间:2015-12-26 23:32:03

标签: python python-2.7 beautifulsoup web-crawler

I am facing issue while extracting 'a' tags from https://www.symantec.com/index.jsp

以下是代码,它为我提供了一组空白的链接&#39;。

from bs4 import BeautifulSoup
import urllib2

response = urllib2.urlopen('https://www.symantec.com/index.jsp').read()
soup = BeautifulSoup(response, 'html.parser')
links = soup.find_all('a')
print(links)

对于其他网址,代码正在运行,但它不适用于此代码。是因为index.jsp是动态的吗?可能是什么决议?

我正在使用python 2.7。

1 个答案:

答案 0 :(得分:4)

将解析器更改为html5liblxml

soup = BeautifulSoup(response, 'html5lib')
soup = BeautifulSoup(response, 'lxml')

需要html5liblxml to be installed:

pip install html5lib
pip install lxml

证明:

>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> 
>>> response = urllib2.urlopen('https://www.symantec.com/index.jsp').read()
>>> len(BeautifulSoup(response, 'html.parser').find_all("a"))
0
>>> len(BeautifulSoup(response, 'html5lib').find_all("a"))
187
>>> len(BeautifulSoup(response, 'lxml').find_all("a"))
187

另见文件的相关部分: