无法使用新的scrapy.Request使Scrapy执行回调

时间:2019-12-29 10:14:06

标签: python beautifulsoup scrapy scrapy-splash splash-js-render

我一直在用我训练过的AI机器人重建用于对图像排序的刮板。我从中获取HTML的网站的HTML嵌套不好,因为其中大部分是使用JavaScript呈现的。这导致无法在页面中间检测HTML,从而使我无法使用CSS选择器或XPath提取所需的链接。

将HTML保存到文件并使用Python HTML formatter对其进行格式化后,我必须从该文件中读取格式化的HTML,然后提取数据。我已经确认我可以使用file:///协议将URL设置为文件。

会发生什么:

  1. 入门链接由SplashRequest处理(有效)
  2. 入门链接的指定回调(strip_and_save)被称为(工作)
  3. strip_and_save实例化一个针对格式化HTML文件的新scrapy.Request,并设置了一个用于解析的回调(scrapy.Request实例化-尽我所知,它可以正确地实例化)< / li>
  4. scrapy.Request调用self.parse并执行解析的内容(不会发生)

问题:从不调用解析。我已经检查了多种形式,但是无法执行。


这是我的代码:

import scrapy
import subprocess
from scrapy_splash import SplashRequest

class BrickSetSpider(scrapy.Spider):
    name = "mmbphotos"
    start_urls = ['https://www.umichbandphotos.com/2019-Season/Maggie-St-Clair/Celebration-of-Life-9292019/']
    allowed_domains = ['www.umichbandphotos.com',]

    # lua that executes using Splash to render the page with JS
    script = '''
        function main(splash, args)
            assert(splash:go(args.url))
            assert(splash:wait(2.0))

            instagramScrollDown = splash:select('.sm-page-widget-social-links-instagram')
            for i = 0, 10, 1
            do
                instagramScrollDown:scrollIntoView()
                assert(splash:wait(0.5))
            end

            return {
                html = splash:html(),
            }
        end
    '''

    # Using SplashRequest becuase url page requires JS rendering
    # Splash is a JS render service
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.strip_and_save, endpoint='execute',
                args={'lua_source': self.script},
            )

    # Reformats HTML to be legible by the Scrapy Engine
    def strip_and_save(self, response):
        # <Body>...</Body> is detected by Scrapy
        # Only saving <Body>...</Body> save on space and execution time
        BODY_SELECTOR = '//body'
        bad_HTML = open('./fix_me.html', 'w')
        bad_HTML.write(response.xpath(BODY_SELECTOR).get())
        bad_HTML.close()

        # Fixing bad html returned by UMICHBandPhotos.com
        # css-html-prettify requires it's input to come from a file
        # css-html-prettify formats using lxml via BeautifulSoup
        subprocess.run(['css-html-prettify.py',
                  '/home/gavinsiver/band-faces/scraper/fix_me.html'])

        file_url = 'file:///home/gavinsiver/band-faces/scraper/fix_me.html'

        # Not calling a SplashRequest because no JS render engine needed
        # Note: SplashRequest did not execute self.parse either
        cleaned_page_request = scrapy.Request(file_url, callback=self.parse)
        yield cleaned_page_request

    # Should save response.text to a file
    # Does not currently execute; reason: unknown
    def parse(self, response):
        #XPath to get image URLs
        THUMBNAIL_SELECTOR = "//div/ul/li/div/a/@href"
        print('EXECUTED') # Debugging if executed
        file = open('./images.txt', 'w')
        file.write('EXECUTED\n') # Debugging if executed
        file.write('Processed\n'.join(response.text))
        file.close()

0 个答案:

没有答案