验证scrapy项目代码

时间:2014-02-26 11:41:30

标签: python scrapy

我尝试从这个网站提取工作机会信息,这是我的代码

from scrapy.spider import Spider
from scrapy.selector import Selector

from tutorial.items import DmozItem

class DmozSpider(Spider):
    name = "myspider"
    allowed_domains =["tanitjobs.com/"]
    start_urls =["http://tanitjobs.com/search-results-jobs/"]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//div[@class="offre"]/div[@class="detail"]')
        items = []
        item = DmozItem()
        for site in sites:
            item['title'] = site.xpath('a/text()').extract()
            item['link'] = site.xpath('a/@href').extract()
            item['desc'] = site.xpath('div[@class="descriptionjob"]/text()').extract()
            items.append(item)
        return items

但结果不正确(空项目列表):

    {'desc': [],
     'link': [u'lien'],
     'title': []}

和这样的很多街区......

2 个答案:

答案 0 :(得分:2)

应该为每个循环迭代调用

item = DmozItem(),否则您将重写相同的项目,将相同的项目附加到items列表

应该看起来像:

from scrapy.spider import Spider
from scrapy.selector import Selector

from tutorial.items import DmozItem

class DmozSpider(Spider):
    name = "myspider"
    allowed_domains =["tanitjobs.com/"]
    start_urls =["http://tanitjobs.com/search-results-jobs/"]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//div[@class="offre"]/div[@class="detail"]')
        items = []
        for site in sites:
            item = DmozItem()
            item['title'] = site.xpath('a/text()').extract()
            item['link'] = site.xpath('a/@href').extract()
            item['desc'] = site.xpath('div[@class="descriptionjob"]/text()').extract()
            items.append(item)
        return items

答案 1 :(得分:0)

您的标题xpath没有考虑文本两侧的<strong>标记,而您的desc xpath需要沿着另一个div去查找所需的信息。

我刚注意到作业描述的xpath有所不同。下面代码中的xpath返回前三个结果的作业描述,但不返回后续结果。您需要检查后续结果,以确定xpath如何更改以检索这些作业的描述。

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//div[@class="offre"]/div[@class="detail"]')
    items = []
    for site in sites:
        item = DmozItem()
        item['title'] = site.xpath('normalize-space(a/strong/text())').extract()
        item['link'] = site.xpath('a/@href').extract()
        item['desc'] = site.xpath('normalize-space(./div/div[@class="descriptionjob"]/text())').extract()
        items.append(item)
    return items