如何在scrapy脚本中将几个“ yield”命令集成在一起

时间:2019-01-23 00:27:04

标签: python web-scraping scrapy web-crawler

我的问题是,当我将Can't get Scrapy to parse and follow 301, 302 redirects中的重定向代码添加到脚本中时,它可以解决此问题,因为它现在可以正常运行,但是现在我的csv文件没有任何输出。问题在于在parse_links1中,if和else语句以'yield'语句结尾,这似乎在阻止scrapy.Request行的实现。这很清楚,因为在此代码的上一迭代中(仅下降了2级链接),该代码完美运行。但是由于最新级别存在重定向问题,因此我不得不在其中添加代码。

我的代码是这样的:

    class TurboSpider(scrapy.Spider):
        name = "fourtier"
        handle_httpstatus_list = [404]
        start_urls = [
         "https://ttlc.intuit.com/browse/cd-download-support"]
        # def parse gets first set of links to use
        def parse(self, response):

            links = response.selector.xpath('//ul[contains(@class, 
         "list-unstyled")]//@href').extract()
            for link in links:
                 yield scrapy.Request(link, self.parse_links, 
                   dont_filter=True)


        def parse_links(self, response):
            tier2_text = response.selector.xpath('//a[contains(@class, 
    "dropdown-item-link")]//@href').extract()
            for link in tier2_text:
                schema = 'https://turbotax.intuit.com/'
                links_to_use = urlparse.urljoin(schema, link)
                yield scrapy.Request(links_to_use, self.parse_links1)


        def parse_links1(self, response):
            tier2A_text = response.selector.xpath('//a').extract()

            for t in tier2A_text:
                if response.status >= 300 and response.status < 400:
                   # HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
                  location= 
           to_native_str(response.headers['location'].decode('latin1'))
                    request = response.request
                    redirected_url = urljoin(request.url, location)
                    if response.status in (301, 307) or request.method 
                    == 'HEAD':
                        redirected = 
                    request.replace(url=redirected_url)
                        yield redirected
                    else:
                        redirected = 
            request.replace(url=redirected_url, 
                    method='GET', body='')
                        redirected.headers.pop('Content-Type', None)
                        redirected.headers.pop('Content-Length', None)
                        yield redirected
                    yield scrapy.Request((t, self.parse_links2))


        def parse_links2(self, response):
            divs = response.selector.xpath('//div')
            for p in divs.select('.//p'):
                yield{'text':p.extract()}

我在parse_links1函数中设置'yield'的方式有什么问题,所以现在我没有任何输出?如何将多个“屈服”命令集成在一起?

1 个答案:

答案 0 :(得分:0)

请参见Debugging Spiders

一些日志记录语句应允许您确定意外情况发生的位置(执行未到达特定行,某些变量包含意外数据),这反过来又可以帮助您了解问题所在或编写更具体的问题来更容易回答。