Question

我想用Python创建一个函数来获取网站内容，例如，获取网站组织内容。

在代码中，组织是东京大学：

<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>

如何在没有像http://www.ip-adress.com/ip_tracer/157.123.22.11

这样的新安装的情况下直接获取网站内容

Answer 1

我喜欢BeautifulSoup，它可以轻松访问HTML字符串中的数据。实际的复杂性取决于HTML的形成方式。如果HTML使用'id'和'class'es，那很简单。如果没有，你依赖于一些更静态的东西，比如“拿第一个div，第二个列表项......”，如果HTML的内容发生了很大的变化，这很糟糕。

要下载HTML，我引用BeautifulSoup文档中的示例：

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print

Answer 2

使用BeautifulSoup：

import bs4

html = """<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>
"""
soup = bs4.BeautifulSoup(html)
univ = soup.tr.td.getText()
assert univ == u"University of Tokyo"

修改

如果您需要先阅读HTML，请使用urllib2：

import urllib2 html = urllib2.urlopen("http://example.com/").read()

Answer 3

您将获得使用403 Access Forbidden error的{{1}}，因为此网站通过检查是否被识别的用户代理访问来过滤访问权限。所以这是完整的事情：

urllib2.urlopen

使用Python解析HTML

3 个答案: