我一直在用我训练过的AI机器人重建用于对图像排序的刮板。我从中获取HTML的网站的HTML嵌套不好,因为其中大部分是使用JavaScript呈现的。这导致无法在页面中间检测HTML,从而使我无法使用CSS选择器或XPath提取所需的链接。
将HTML保存到文件并使用Python HTML formatter对其进行格式化后,我必须从该文件中读取格式化的HTML,然后提取数据。我已经确认我可以使用file:///
协议将URL设置为文件。
会发生什么:
SplashRequest
处理(有效)strip_and_save
)被称为(工作)strip_and_save
实例化一个针对格式化HTML文件的新scrapy.Request
,并设置了一个用于解析的回调(scrapy.Request
实例化-尽我所知,它可以正确地实例化)< / li>
scrapy.Request
调用self.parse
并执行解析的内容(不会发生)问题:从不调用解析。我已经检查了多种形式,但是无法执行。
这是我的代码:
import scrapy
import subprocess
from scrapy_splash import SplashRequest
class BrickSetSpider(scrapy.Spider):
name = "mmbphotos"
start_urls = ['https://www.umichbandphotos.com/2019-Season/Maggie-St-Clair/Celebration-of-Life-9292019/']
allowed_domains = ['www.umichbandphotos.com',]
# lua that executes using Splash to render the page with JS
script = '''
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(2.0))
instagramScrollDown = splash:select('.sm-page-widget-social-links-instagram')
for i = 0, 10, 1
do
instagramScrollDown:scrollIntoView()
assert(splash:wait(0.5))
end
return {
html = splash:html(),
}
end
'''
# Using SplashRequest becuase url page requires JS rendering
# Splash is a JS render service
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.strip_and_save, endpoint='execute',
args={'lua_source': self.script},
)
# Reformats HTML to be legible by the Scrapy Engine
def strip_and_save(self, response):
# <Body>...</Body> is detected by Scrapy
# Only saving <Body>...</Body> save on space and execution time
BODY_SELECTOR = '//body'
bad_HTML = open('./fix_me.html', 'w')
bad_HTML.write(response.xpath(BODY_SELECTOR).get())
bad_HTML.close()
# Fixing bad html returned by UMICHBandPhotos.com
# css-html-prettify requires it's input to come from a file
# css-html-prettify formats using lxml via BeautifulSoup
subprocess.run(['css-html-prettify.py',
'/home/gavinsiver/band-faces/scraper/fix_me.html'])
file_url = 'file:///home/gavinsiver/band-faces/scraper/fix_me.html'
# Not calling a SplashRequest because no JS render engine needed
# Note: SplashRequest did not execute self.parse either
cleaned_page_request = scrapy.Request(file_url, callback=self.parse)
yield cleaned_page_request
# Should save response.text to a file
# Does not currently execute; reason: unknown
def parse(self, response):
#XPath to get image URLs
THUMBNAIL_SELECTOR = "//div/ul/li/div/a/@href"
print('EXECUTED') # Debugging if executed
file = open('./images.txt', 'w')
file.write('EXECUTED\n') # Debugging if executed
file.write('Processed\n'.join(response.text))
file.close()