Question

尝试使用此命令从 py 文件启动scrapy：

py myproject.py -f C:\Users\admin\Downloads\test.csv

这里我的文件名为“myproject.py”

import spiders.ggspider as MySpiders

# Return array
dataFile = args.file
myData = CSVReader.getAnimalList(dataFile)
leSpider = MySpiders.GGCSpider()
leSpider.myList = myData 
leSpider.start_requests()

这是我的蜘蛛文件：

import scrapy
import urllib

class GGSpider(scrapy.Spider):
    name = "spiderman"
    domain = "https://www.google.fr/?q={}"
    myList = []
   
    def __init__(self):
        pass

    def start_requests(self):
        for leObject in self.myList:
            tmpURL = self.domain.format(urllib.parse.urlencode({'text' : leObject[0]}))
            yield scrapy.Request(url=self.domain+leObject[0],callback = self.parse)

    def parse(self, response):
        print('hello')
        print(response)

我的问题是：我进入 start_requests，因为我在产量之前打印并在控制台中打印但回调似乎没有追加（我没有得到“你好”打印）。

我真的不知道为什么（我是 Python 新手，也许我遗漏了一些明显的东西）

Answer 1

我猜这是因为在您检索其值之前生成器实际上并未运行。您可以尝试以某种方式使用生成器：

import spiders.ggspider as MySpiders

# Return array
dataFile = args.file
myData = CSVReader.getAnimalList(dataFile)
leSpider = MySpiders.GGCSpider()
leSpider.myList = myData 

for request in leSpider.start_requests():
    do_something(request)

UPD：这是一个更好的example of running Spider from a script：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess(settings={
    "FEEDS": {
        "items.json": {"format": "json"},
    },
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

使用 py 文件中的 scrapy（不在命令行中）

1 个答案: