
时间:2018-04-02 18:07:38

标签: python tree scrapy


我想followextract位于links的所有xpath (//div[@class="work_area_content"]/a'),并使用相同的xpath遍历所有链接,直到每个链接的最深层。我尝试过使用下面的代码:但是,它只通过主要图层,并不会跟随每个链接。


class DatabloggerSpider(CrawlSpider):
    # The name of the spider
    name = "jobs"

    # The domains that are allowed (links to other domains are skipped)
    allowed_domains = ['']

    # The URLs to start with
    start_urls = ['']

    # Method for parsing items
    def parse(self, response):
        # The list of items that are found on the particular page
        items = []
        # Only extract canonicalized and unique links (with respect to the current page)
        test_str = response.text
        # Removes string between two placeholders with regex
        regex = r"(Back to)(.|\n)*?<br><br>"
        regex_response = re.sub(regex, "", test_str)
        regex_response2 = HtmlResponse(regex_response) ##TODO: fix here!

        links = LinkExtractor(canonicalize=True, unique=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2)
        # #Now go through all the found links
        for link in links:
            item = DatabloggerScraperItem()
            item['url_from'] = response.url
            item['url_to'] = link.url
        yield scrapy.Request(links, callback=self.parse, dont_filter=True)

        #Return all the found items
        return items

2 个答案:

答案 0 :(得分:1)

我认为您应该使用SgmlLinkExtractor follow=True参数设置。


links = SgmlLinkExtractor(follow=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2))

由于您使用的是CrawlSpider,因此您应该定义规则,请查看this blog post here以获取完整的示例。

答案 1 :(得分:-1)


    def parse(self, response):
            for href in response.css('h2.seotitle > a::attr(href)'):
                url = response.urljoin(href.extract())
                yield scrapy.Request(url =url, callback = self.parse_details)

            next_page_url = response.css('ul.pager').xpath('//a[contains(text(), "Next")]/@althref').extract_first()
            print next_page_url
            if next_page_url:
               nextpage = response.css('ul.pager').xpath('//a[contains(text(), "Next")]/@onclick').extract_first()
               searchresult_num = nextpage.split("'")[1].strip()
               next_page_url = "http://jobsearch.monsterindia.com/searchresult.html?day=1&n="+searchresult_num
               next_page_url = response.urljoin(next_page_url) 
               print next_page_url
               yield scrapy.Request(url = next_page_url, callback = self.parse)    

        def parse_details(self,response):