我正在尝试屏蔽网站上的值。
# get the raw HTML
fruitsWebsite = lxml.html.parse( "http://pagetoscrape.com/data.html" )
# get all divs with class fruit
fruits = fruitsWebsite.xpath( '//div[@class="fruit"]' )
# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
print fruit.xpath('//li[@class="fruit"]/em')[0].text
然而,Python解释器抱怨0是一个超出边界的迭代器。这很有趣,因为我确信元素存在。使用lxml访问内部<em>
元素的正确方法是什么?
答案 0 :(得分:2)
以下代码适用于我的测试文件。
#test.py
import lxml.html
# get the raw HTML
fruitsWebsite = lxml.html.parse('test.html')
# get all divs with class fruit
fruits = fruitsWebsite.xpath('//div[@class="fruit"]')
# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
#Use a relative path so we don't find ALL of the li/em elements several times. Note the .//
for item in fruit.xpath('.//li[@class="fruit"]/em'):
print(item.text)
#Alternatively
for item in fruit.xpath('//div[@class="fruit"]//li[@class="fruit"]/em'):
print(item.text)
这是我以前再次测试的html文件。如果这对您再次测试的html不起作用,则您需要发布我在上述评论中请求失败的示例文件。
<html>
<body>
Blah blah
<div>Ignore me</div>
<div>Outer stuff
<div class='fruit'>Some <em>FRUITY</em> stuff.
<ol>
<li class='fruit'><em>This</em> should show</li>
<li><em>Super</em> Ignored LI</li>
<li class='fruit'><em>Rawr</em> Hear it roar.</li>
</ol>
</div>
</div>
<div class='fruit'><em>Super</em> fruity website of awesome</div>
</body>
</html>
使用最初发布的代码肯定会得到太多结果(内部循环将搜索整个树而不是每个&#34;水果&#34;)的子树。除非您的输入与我理解的不同,否则您所描述的错误并没有多大意义。