我创建了一个蜘蛛,用于检查特定的电影预订站点,是否已打开电影以进行预订。它每10秒检查一次。但是我面临的问题是,即使在网站上打开预订,我的代码也无法获取更新的网站,而是使用旧的报废数据。
例如:
我报废了该网站,影片“ A”没有在上午8点开放预订。电影“ A”的预订在下午12点开放,但蜘蛛显示它尚未开放预订。需要注意的是,我使用的是不确定的while循环,所以我从8AM开始运行该程序,从未停止。
代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import threading
import time
import datetime
import winsound
class NewFilmSpiderSpider(scrapy.Spider):
name = 'new_film_spider'
allowed_domains = ['www.spicinemas.in']
start_urls = ['https://www.spicinemas.in/coimbatore/now-showing']
def parse(self, response):
t = threading.Thread(self.getDetails(response))
t.start()
def getDetails(self, response):
while True:
records = response.xpath('//section[@class="main-section"]/section[2]/section[@class="movie__listing now-showing"]/ul/li/div/dl/dt/a/text()').extract()
if 'NGK' in str(records):
try:
print("Booking Opened",datetime.datetime.now())
winsound.PlaySound('alert.wav', winsound.SND_FILENAME)
except Exception:
print ("Error: unable to play sound")
else:
print("Booking Not Opened",datetime.datetime.now())
time.sleep(10)
如果您现在运行代码,则说预订已开始。但是我需要在while循环中删除该网页。我该怎么办?
更新#1:
使用下面给出的解决方案运行时,我得到了这些跟踪信息
File "C:\Users\ranji\Documents\Spiders\SpiCinemasSpider\spicinemas_spider\spiders\new_film_spider.py", line 34, in <module>
main()
File "C:\Users\ranji\Documents\Spiders\SpiCinemasSpider\spicinemas_spider\spiders\new_film_spider.py", line 30, in main
process.start()
File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\crawler.py", line 293, in start
reactor.run(installSignalHandlers=False) # blocking call
File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\base.py", line 1271, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\base.py", line 1251, in startRunning
ReactorBase.startRunning(self)
File "C:\Users\ranji\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\base.py", line 754, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
答案 0 :(得分:0)
问题是因为线程每次仅在同一组“响应”数据上工作,并且期望它发生变化。以下是修改后的代码,显示了如何将其每隔10秒用于变体并检查xpath值。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request
import time
import datetime
import winsound
class NewFilmSpiderSpider(scrapy.Spider):
name = 'new_film_spider'
allowed_domains = ['www.spicinemas.in']
start_urls = ['https://www.spicinemas.in/coimbatore/now-showing']
def parse(self, response):
records = response.xpath('//section[@class="main-section"]/section[2]/section[@class="movie__listing now-showing"]/ul/li/div/dl/dt/a/text()').extract()
if 'NGK' in str(records):
try:
print("Booking Opened",datetime.datetime.now())
winsound.PlaySound('alert.wav', winsound.SND_FILENAME)
except Exception:
print ("Error: unable to play sound")
else:
print("Booking Not Opened",datetime.datetime.now())
def main():
try:
process = CrawlerProcess()
process.crawl(NewFilmSpiderSpider)
process.start()
while True:
process.crawl(NewFilmSpiderSpider)
time.sleep(10)
except KeyboardInterrupt:
process.join()
if __name__ == "__main__":
main()
参考:https://doc.scrapy.org/en/latest/topics/practices.html,https://stackoverflow.com/a/43480164/1509809