我想构建一个有两个按钮的gui,"打开输入文件"和"运行"。当用户点击"打开输入文件"时,他/她可以从他/她的计算机中选择一个文件,该文件中有一列包含网址。当该人点击" Run"时,会初始化基于scrapy的脚本,该脚本使用输入文件中的url作为start_urls(例如:https://doc.scrapy.org/en/latest/topics/spiders.html)。
我的脚本如下所示:
import scrapy
import sys
from PyQt5 import QtCore, QtGui, QtWidgets
from PyQt5.QtWidgets import QApplication, QMainWindow, QFileDialog
from scrapy.crawler import CrawlerProcess
file = "Empty"
class MySpider(scrapy.Spider):
global file
name = "scriptTest" #name of spider
allowed_domains = ["web"] #where is spider allowed to crawl
start_urls = [file] #where will spider crawl
def parse(self): #scrapes start_urls according to instructions and returns results
class MyGui(object): #gives description of class type MyGui
filename = 'Empty'
file = []
def setupUI(self): #describes how base form of gui will look
def buttons(self): #creates buttons and connects functions to those buttons
self.pushButton.setText(_translate("MainWindow", "Open Input File:")) #creates button with text
self.pushButton.clicked.connect(self.showDialog) #connects button one to function showDialog
self.pushButton_2.setText(_translate("MainWindow", "Run")) #creates button2 with text
self.pushButton_2.clicked.connect(self.runSpider) #connects button two to function runSpider
def showDialog(self): #opens QFileDialog and sets global file to name of selected file
def runSpider(self): #should start crawling urls from selected file
global file
global filename
def getUrls(filename): #returns first column containing urls (given by gui user in showDialog) as array.
file = getUrls() #sets global variable file as returned value of getExcelData
process = CrawlerProcess() #creates object 'process' that is of type 'Crawlerprocess'
process.crawl(MySpider) #starts crawling
process.start() # the script will block here until the crawling is finished
app = QApplication(sys.argv)
window = QMainWindow()
ui = MyGui() #creates object called 'ui' of type 'MyGui
ui.setupUi(window) #launches gui window
就像我说的,我想在点击pushButton之后使用所选文件中的url作为蜘蛛的start_urls。但是,当我点击"运行"蜘蛛使用值"空"作为start_urls而不是使用全局变量文件的新值。我想我理解为什么;该类是对象的描述,因此当初始化对象时,它将具有所描述的类的属性。
我试图通过以下方式解决问题:
class MySpider:
def __init__(self, arg):
self.arg = arg
但我还没有找到解决方案。
问:如何将用户选择的文件传递给MySpider类?
提前致谢,如果我说错了,请纠正我! (对不起,如果我的代码非常混乱/不清楚,我还在学习很多东西。)
答案 0 :(得分:1)
start_urls = [file]
时, file
未更新。它保留了file
的先前引用。
快速解决方法(我确定存在更好的解决方案)是直接更新start_urls
类变量:
MySpider.start_urls = getUrls()
process.crawl(MySpider) #starts crawling
优势在于您不再需要全局file
变量