Scrapy在第一次结果后结束

时间:2017-10-02 17:23:49

标签: scrapy scrapy-spider

我一直在四处寻找,无法找到我正在寻找的答案。我让我的爬虫(scrapy)将结果返回到我正在寻找的位置。所以我现在要做的是让它从页面中提取多个结果。目前它拉第一个并停止。如果我取下extract_first(),那么它会提取所有数据并对它们进行分组。所以寻找两个可行的答案之一。

1)继续抓取结果而不是结束 2)将每个项目取消组合到新的结果行

这是我的代码:

    import scrapy
from scrapy.selector import Selector
from urlparse import urlparse
from urlparse import urljoin
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
#from scrappy.http import HtmlResponse

class MySpider(CrawlSpider):
    name = "ziprecruiter"

    def start_requests(self):
        allowed_domains = ["https://www.ziprecruiter.com/"]     
        urls = [
            'https://www.ziprecruiter.com/candidate/search?search=operations+manager&location=San+Francisco%2C+CA'
            ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for houses in response.xpath('/html/body'):
            yield {

                'Job_title:' : houses.xpath('.//span[@class="just_job_title"]//text()[1]').extract_first(),
                'Company:' : houses.xpath('.//a[@class="t_org_link name"]//text()[1]').extract_first(),
                'Location:' : houses.xpath('.//a[@class="t_location_link location"]//text()[1]').extract_first(),
                'FT/PT:' : houses.xpath('.//span[@class="data_item"]//text()[1]').extract_first(),

                'Link' : houses.xpath('/html/body/main/div/section/div/div[2]/div/div[2]/div[1]/article[4]/div[1]/button[1]/text()').extract_first(),
                'Link' : houses.xpath('.//a/@href[1]').extract_first(),
                'pay' : houses.xpath('./section[@class="perks_item"]/span[@class="data_item"]//text()[1]').extract_first()

                }

提前谢谢!

EDIT :: 经过更多的研究,我重新定义了容器以便爬行,这给了我所有正确的答案。现在我的问题是我如何获取页面上的每个项目而不是只有第一个结果...它只是不循环。继承我的代码:

    import scrapy
from scrapy.selector import Selector
from urlparse import urlparse
from urlparse import urljoin
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
#from scrappy.http import HtmlResponse

class MySpider(CrawlSpider):
    name = "ziprecruiter"

    def start_requests(self):
        allowed_domains = ["https://www.ziprecruiter.com/"]     
        urls = [
            'https://www.ziprecruiter.com/candidate/search?search=operations+manager&location=San+Francisco%2C+CA'
            ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for houses in response.xpath('/html/body/main/div/section/div/div[2]/div/div[2]/div[1]/article[1]/div[2]'):
            yield {

                'Job_title:' : houses.xpath('.//span[@class="just_job_title"]//text()').extract(),
                'Company:' : houses.xpath('.//a[@class="t_org_link name"]//text()').extract(),
                'Location:' : houses.xpath('.//a[@class="t_location_link location"]//text()').extract(),
                'FT/PT:' : houses.xpath('.//span[@class="data_item"]//text()').extract(),
                'Link' : houses.xpath('.//a/@href').extract(),
                'pay' : houses.xpath('./section[@class="perks_item"]/span[@class="data_item"]//text()').extract()

                }

1 个答案:

答案 0 :(得分:1)

我觉得你应该使用这个xpath:

//div[@class="job_content"]

因为那是你正在寻找的div的类。当我为这个页面执行它时,我返回了20个div元素。但是,您可能希望在xpath查询中添加一些过滤,以防万一有其他具有您不想解析的类名的div。