Question

所以我试图用美丽的汤来获取this page的内容。我想创建一个包含所有css颜色名称的字典，这似乎是一种快速简便的方法来访问它。所以我很自然地做了快速基础：

from bs4 import BeautifulSoup as bs
url = 'http://www.w3schools.com/cssref/css_colornames.asp'
soup = bs(url)

出于某种原因，我只是在身体内的p标签中获取了网址，就是这样：

>>> print soup.prettify()
<html>
 <body>
  <p>
   http://www.w3schools.com/cssref/css_colornames.asp
  </p>
 </body>
</html>

为什么不让BeautifulSoup能够访问我需要的信息？

Answer 1

Beautifulsoup 不为您加载网址。

您需要传入完整的HTML页面，这意味着您需要先从URL加载它。以下是使用urllib2.urlopen function实现该目标的示例：

from urllib2 import urlopen
from bs4 import BeautifulSoup as bs

source = urlopen(url).read()
soup = bs(source)

现在你可以很好地提取颜色了：

css_table = soup.find('table', class_='reference')
for row in css_table.find_all('tr'):
    cells = row.find_all('td')
    if cells:
        print cells[0].a.text, cells[1].a.text

Beautifulsoup网址加载错误

1 个答案: