当在其中添加//时,xpath仅适用于一个项目

时间:2014-02-15 20:55:43

标签: python-2.7 xpath scrapy

我有这个html页面

<page>
<div class="results-list">

    <div class="item paid-featured-item"></div>
    <div class="item paid-featured-item"></div>
    <div class="item paid-featured-item"></div>
    <div class="item paid-featured-item"></div>
    <div class="item paid-featured-item"></div>
    <div class="item paid-featured-item"></div>
    <div class="item paid-featured-item"></div>
    <div class="item paid-featured-item"></div>

</div>
</page>

并在每个“付费特色商品”中,我有:

<div class="item paid-featured-item">
    <div class="somethign">
        <div class="title">
            This is the title
        </div>
    </div>
    <div class="anotherthing">
    </div>
</div>

我想使用xpath提取标题。

我尝试了什么

Container = "//div[@class='results-list']"

for item in Container:
    title = "//div[@class='title']/text()"

我获得了8个标题,但每个标题都是第一个标题。

我确信那是因为我使用了//

我该怎么办?

第一

我不想使用css选择器,因为我的工作不允许这样做

第二

我不想使用class="something",因为这个div并不总是存在于我的页面中。

第三

我正在使用scthon与python

第四

感谢您的帮助

1 个答案:

答案 0 :(得分:2)

说出您的页面(page.html):

<page>
  <div id="results-list">
    <div class="item paid-featured-item">
      <div class="something">
        <div class="title">Title 1</div>
      </div>
      <div class="anotherthing"></div>
    </div>
    <div class="item paid-featured-item">
      <div class="something">
        <div class="title">Title 2</div>
      </div>
      <div class="anotherthing"></div>
    </div>
    <div class="item paid-featured-item">
      <div class="something">
        <div class="title">Title 3</div>
      </div>
      <div class="anotherthing"></div>
    </div>
    <div class="item paid-featured-item">
      <div class="something">
        <div class="title">Title 4</div>
      </div>
      <div class="anotherthing"></div>
    </div>
  </div>
</page>

要提取每个标题,请执行以下操作:

from scrapy.selector import Selector
sel = Selector(text=open('page.html').read())

container = sel.xpath('//div[@id="results-list"]')
items = container.xpath('.//div[@class="item paid-featured-item"]')
for item in items:
    # *extracted* is a single-item list containing the title.
    extracted = item.xpath('.//div[@class="title"]/text()').extract()
    title = extracted[0]
    print title

这将输出:

Title 1
Title 2
Title 3
Title 4