Scrapy:嵌套选择器:子选择器在所有页面上运行而不在父选择上运行

时间:2018-01-23 17:48:26

标签: python-3.x xpath scrapy

我想在.jl中保存与列出很多人的网页相关的所有与项目相关的数据(比如一个人)。 解析应该是这样的

for eachperson in response.xpath("//div[@class='person']"):
            person=myItem()
            person['name'] = eachperson .xpath('//h2[@class="name"]/text()').extract()
            person['date'] = eachperson .xpath('//h3[@class="date"]/text()').extract()
            person['address'] = eachperson .xpath('//div[@class="address"]/p/text()').extract()
            yield person

但我得到了一个错误。我已将我的蜘蛛改编为页面http://quotes.toscrape.com/(参见下文),以便您可以重现它。

import scrapy
import requests

class TutoSpider(scrapy.Spider):
    name = "tuto"
    start_urls = [
            'file:///C:/Users/Me/Desktop/data.html'
        ]

    def parse(self, response):
        for quotechild in response.xpath("//div[@class='quote']"):
            print("\n\n", quotechild.extract())
            print("\n\n", quotechild.xpath('//span[@class="text"]/text()').extract())

第一次打印返回预期的内容,但第二次打印将整页的所有span class="text"作为list返回,而不仅仅是来自quotechild的那一页。

我跟随https://doc.scrapy.org和其他许多tuto,但我无法找到我做错的事。

我在本地文件上运行,因为我正在使用的原始页面,通过javascript渲染html .hml只是http://quotes.toscrape.com/

的来源

首次打印示例:

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>
        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
        <a href="/author/Eleanor-Roosevelt">(about)</a>
        </span>
        ...
    </div>

第二次打印的示例(我希望每次打印时列表中只有一个项目):

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']

1 个答案:

答案 0 :(得分:2)

使用//启动xpath表达式将使其在文档根目录开始匹配,无论您使用哪个元素。

要使xpath相对于元素(仅搜索其后代),请使用.//

启动表达式
>>> len(quotechild.xpath('//span[@class="text"]/text()'))
10
>>> len(quotechild.xpath('.//span[@class="text"]/text()'))
1
>>> quotechild.xpath('.//span[@class="text"]/text()').extract_first()
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'