尝试使用此命令从 py 文件启动scrapy:
py myproject.py -f C:\Users\admin\Downloads\test.csv
这里我的文件名为“myproject.py”
import spiders.ggspider as MySpiders
# Return array
dataFile = args.file
myData = CSVReader.getAnimalList(dataFile)
leSpider = MySpiders.GGCSpider()
leSpider.myList = myData
leSpider.start_requests()
这是我的蜘蛛文件:
import scrapy
import urllib
class GGSpider(scrapy.Spider):
name = "spiderman"
domain = "https://www.google.fr/?q={}"
myList = []
def __init__(self):
pass
def start_requests(self):
for leObject in self.myList:
tmpURL = self.domain.format(urllib.parse.urlencode({'text' : leObject[0]}))
yield scrapy.Request(url=self.domain+leObject[0],callback = self.parse)
def parse(self, response):
print('hello')
print(response)
我的问题是:我进入 start_requests,因为我在产量之前打印并在控制台中打印 但回调似乎没有追加(我没有得到“你好”打印)。
我真的不知道为什么(我是 Python 新手,也许我遗漏了一些明显的东西)
答案 0 :(得分:1)
我猜这是因为在您检索其值之前生成器实际上并未运行。您可以尝试以某种方式使用生成器:
import spiders.ggspider as MySpiders
# Return array
dataFile = args.file
myData = CSVReader.getAnimalList(dataFile)
leSpider = MySpiders.GGCSpider()
leSpider.myList = myData
for request in leSpider.start_requests():
do_something(request)
UPD:这是一个更好的example of running Spider from a script:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished