我有这个html页面
<page>
<div class="results-list">
<div class="item paid-featured-item"></div>
<div class="item paid-featured-item"></div>
<div class="item paid-featured-item"></div>
<div class="item paid-featured-item"></div>
<div class="item paid-featured-item"></div>
<div class="item paid-featured-item"></div>
<div class="item paid-featured-item"></div>
<div class="item paid-featured-item"></div>
</div>
</page>
并在每个“付费特色商品”中,我有:
<div class="item paid-featured-item">
<div class="somethign">
<div class="title">
This is the title
</div>
</div>
<div class="anotherthing">
</div>
</div>
我想使用xpath提取标题。
Container = "//div[@class='results-list']"
for item in Container:
title = "//div[@class='title']/text()"
我获得了8个标题,但每个标题都是第一个标题。
我该怎么办?
我不想使用css选择器,因为我的工作不允许这样做
我不想使用class="something"
,因为这个div并不总是存在于我的页面中。
我正在使用scthon与python
感谢您的帮助
答案 0 :(得分:2)
说出您的页面(page.html
):
<page>
<div id="results-list">
<div class="item paid-featured-item">
<div class="something">
<div class="title">Title 1</div>
</div>
<div class="anotherthing"></div>
</div>
<div class="item paid-featured-item">
<div class="something">
<div class="title">Title 2</div>
</div>
<div class="anotherthing"></div>
</div>
<div class="item paid-featured-item">
<div class="something">
<div class="title">Title 3</div>
</div>
<div class="anotherthing"></div>
</div>
<div class="item paid-featured-item">
<div class="something">
<div class="title">Title 4</div>
</div>
<div class="anotherthing"></div>
</div>
</div>
</page>
要提取每个标题,请执行以下操作:
from scrapy.selector import Selector
sel = Selector(text=open('page.html').read())
container = sel.xpath('//div[@id="results-list"]')
items = container.xpath('.//div[@class="item paid-featured-item"]')
for item in items:
# *extracted* is a single-item list containing the title.
extracted = item.xpath('.//div[@class="title"]/text()').extract()
title = extracted[0]
print title
这将输出:
Title 1
Title 2
Title 3
Title 4