我有一个HTML,如下所示。我想得到<span class="zzAggregateRatingStat">
中的文字。根据下面给出的例子,我会得到3和5。
对于这项工作,我使用的是Python2.7&amp; LXML
<div class="pp-meta-review">
<span class="zrvwidget" style="">
<span g:inline="true" g:type="NumUsersFoundThisHelpful" g:hideonnoratings="true" g:entity.annotation.groups="maps" g:entity.annotation.id="http://maps.google.com/?q=Central+Kia+of+Irving++(972)+659-2204+loc:+1600+East+Airport+Freeway,+Irving,+TX+75062&gl=US&sll=32.83624,-96.92526" g:entity.annotation.author="AIe9_BH8MR-1JD_4BhwsKrGCazUyU5siqCtjchckDcg5BAl5rOLd9nvhJJDTrtjL-xFI8D42bD_7">
<span class="zzNumUsersFoundThisHelpfulActive" zzlabel="helpful">
<span>
<span class="zzAggregateRatingStat">3</span>
</span>
<span>
<span> </span>
out of
<span> </span>
</span>
<span>
<span class="zzAggregateRatingStat">5</span>
</span>
<span>
<span> </span>
people found this review helpful.
</span>
</span>
</span>
</span>
</div>
答案 0 :(得分:4)
以下代码适用于您的输入:
import lxml.html
root = lxml.html.parse('text.html').getroot()
for span in root.xpath('//span[@class="zzAggregateRatingStat"]'):
print span.text
打印:
3
5
我更喜欢在 CSSSelectors 上使用lxml
的 xpath ,尽管他们都可以完成这项工作。
ChrisP的示例打印3
,但如果您在实际输入上运行它,我们会收到错误:
$ python chrisp.py
Traceback (most recent call last):
File "chrisp.py", line 6, in <module>
doc = fromstring(text)
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 3, column 210
ChrisP的代码可以更改为使用lxml.html.fromstring
- 这是一个更宽松的解析器 - 而不是lxml.etree.fromstring
。
如果进行了此更改,则会打印3
。
答案 1 :(得分:0)
这是clearly documented at the lxml website
from lxml.etree import fromstring
from lxml.cssselect import CSSSelector
sel = CSSSelector('.zzAggregateRatingStat')
text = '<span><span class="zzAggregateRatingStat">3</span></span>'
doc = fromstring(text)
el = sel(doc)[0]
print el.text