Question

我正在尝试制作一个网络抓取器，它将解析出版物的网页并提取作者。网页的骨架结构如下：

<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>

到目前为止，我一直在尝试使用BeautifulSoup和lxml来完成这项任务，但我不知道如何处理这两个div标签和td标签，因为它们具有属性。除此之外，我不确定我是否应该更多地依赖于BeautifulSoup或lxml或两者的组合。我该怎么办？

目前，我的代码如下所示：

    import re
    import urllib2,sys
    import lxml
    from lxml import etree
    from lxml.html.soupparser import fromstring
    from lxml.etree import tostring
    from lxml.cssselect import CSSSelector
    from BeautifulSoup import BeautifulSoup, NavigableString

    address='http://www.example.com/'
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)
    html=soup.prettify()
    html=html.replace('&nbsp', '&#160')
    html=html.replace('&iacute','&#237')
    root=fromstring(html)

我意识到很多import语句可能是多余的，但我只是复制了我目前在更多源文件中所拥有的内容。

编辑：我想我没有说清楚这一点，但我在页面中有多个我要抓的标签。

Answer 1

我不清楚你的问题，为什么你需要担心div标签 - 做什么只是：

soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string

在你提供的HTML上，运行它会完全发出：

####I want whatever is located here ###

这似乎是你想要的。也许你可以更准确地指定你需要的东西，而这个超级简单的代码片段没有 - 你要考虑的所有类td的多个author标签（所有？只是一些？哪些？），可能缺少任何这样的标签（在这种情况下你想做什么），等等。很难从这个简单的例子和过多的代码中推断出你的规范到底是什么; - ）。

编辑：如果根据OP的最新评论，有多个此类td标记，每个作者一个：

thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
    print thetd.string

...即，一点都不难！ -

Answer 2

或者您可能正在使用pyquery，因为不再主动维护BeautifulSoup，请参阅http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

首先，使用

安装pyquery

easy_install pyquery

然后你的脚本就像

一样简单

from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]

pyquery使用jQuery中熟悉的css选择器语法，我发现它比BeautifulSoup更直观。它使用下面的lxml，比BeautifulSoup快得多。但是BeautifulSoup是纯粹的python，因此也适用于Google的app引擎

Answer 3

lxml库现在是在python中解析html的标准。界面起初看起来很尴尬，但它对它的作用非常有用。

您应该让libary处理xml专业，例如那些转义和实体;

import lxml.html

html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
          <td class="author">####I want whatever is located here, eh? &iacute; ###</td>
          </tr></tbody></table></div></div></body></html>"""

root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")

print tds           # gives [<Element td at 84ee2cc>]
print tds[0].text   # what you want, including the 'í'

Answer 4

BeautifulSoup肯定是规范的HTML解析器/处理器。但是如果你只需要匹配这种代码片段，而不是构建一个代表HTML的整个分层对象，那么pyparsing可以很容易地定义前导和尾随HTML标记，作为创建更大搜索表达式的一部分：

from pyparsing import makeHTMLTags, withAttribute, SkipTo

author_td, end_td = makeHTMLTags("td")

# only interested in <td>'s where class="author"
author_td.setParseAction(withAttribute(("class","author")))

search = author_td + SkipTo(end_td)("body") + end_td

for match in search.searchString(html):
    print match.body

Pyparsing的makeHTMLTags函数不仅仅发出"<tag>"和"</tag>"表达式。它还处理：

无标签的标签匹配
"<tag/>"语法
开始标记中的零个或多个属性
以任意顺序定义的属性
带名称空间的属性名称
单引号，双引号或无引号的属性值
介入标记和符号之间的空格，或属性名称，'='和值
属性在解析为命名结果后可以访问

在考虑使用正则表达式进行HTML抓取时，这些是常见的陷阱。

涉及带有属性的HTML标记的Python Web抓取

4 个答案: