我正在尝试用包装" Scrapy"获取URL列表并抓取它们。我搜索了stackoverflow的答案,但找不到能解决问题的东西。
我的脚本如下:
class Try(scrapy.Spider):
name = "Try"
def __init__(self, *args, **kwargs):
super(Try, self).__init__(*args, **kwargs)
self.start_urls = kwargs.get( "urls" )
print( self.start_urls )
def start_requests(self):
print( self.start_urls )
for url in self.start_urls:
yield Request( url , self.parse )
def parse(self, response):
d = response.xpath( "//body" ).extract()
当我爬蜘蛛时:
Spider = Try(urls = [r"https://www.example.com"])
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Spider)
process.start()
我在打印self.start_urls时打印了以下信息:
为什么我没有?有没有其他方法来解决这个问题?或者我的蜘蛛班上有错误吗?
感谢您提供任何帮助!
答案 0 :(得分:1)
我建议在process.crawl
中使用Spider Class并在那里传递urls
个参数。
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
class Try(scrapy.Spider):
name = 'Try'
def __init__(self, *args, **kwargs):
super(Try, self).__init__(*args, **kwargs)
self.start_urls = kwargs.get("urls")
def start_requests(self):
for url in self.start_urls:
yield Request( url , self.parse )
def parse(self, response):
d = response.xpath("//body").extract()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Try, urls=[r'https://www.example.com'])
process.start()
答案 1 :(得分:0)
如果我跑
process.crawl(Try, urls=[r"https://www.example.com"])
然后按照我的预期将urls
发送到Try
。甚至我也不需要start_requests
。
import scrapy
class Try(scrapy.Spider):
name = "Try"
def __init__(self, *args, **kwargs):
super(Try, self).__init__(*args, **kwargs)
self.start_urls = kwargs.get("urls")
def parse(self, response):
print('>>> url:', response.url)
d = response.xpath( "//body" ).extract()
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Try, urls=[r"https://www.example.com"])
process.start()
但如果我使用
spider = Try(urls = ["https://www.example.com"])
process.crawl(spider)
然后看起来它在没有Try
的情况下运行新的urls
,然后列表为空。