我有以下使用Scrapy的Python脚本:
import scrapy
class ChemSpider(scrapy.Spider):
name = "site"
def start_requests(self):
urls = [
'https://www.site.com.au'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
category_links = response.css('li').xpath('a/@href').getall()
category_links_filtered = [x for x in category_links if 'shop-online' in x] # remove non category links
category_links_filtered = list(dict.fromkeys(category_links_filtered)) # remove duplicates
for category_link in category_links_filtered:
if "medicines" in category_link:
next_page = response.urljoin(category_link) + '?size=10'
self.log(next_page)
yield scrapy.Request(next_page, callback=self.parse_subcategories)
def parse_subcategories(self, response):
for product in response.css('div.Product'):
yield {
'category_link': response.url,
'product_name': product.css('img::attr(alt)').get(),
'product_price': product.css('span.Price::text').get().replace('\n','')
}
我的解决方案将运行此脚本的多个实例,每个实例都从不同的“类别”中抓取不同的信息子集。我知道您可以从命令行运行scrapy以输出到json文件,但是我想从函数中将输出输出到文件,因此每个实例都写入不同的文件。作为Python的初学者,我不确定该在哪里使用脚本。我需要在脚本执行时将yield的输出输出到文件中。我该如何实现?将会刮掉数百行,并且我对yield如何工作还不够熟悉,无法理解如何从中“返回”一组可以写入文件的数据(或列表)。
答案 0 :(得分:0)
您要附加文件。但是,当文件正在写I / O操作时,您需要在文件正在写时锁定文件,以防止其他进程写文件。
最简单的方法是在目录中写入不同的随机文件(具有随机名称的文件),然后使用另一个进程将它们串联起来。
答案 1 :(得分:0)
首先让我建议您对代码进行一些更改。如果您想删除重复项,可以使用set
,如下所示:
category_links_filtered = (x for x in category_links if 'shop-online' in x) # remove non category links
category_links_filtered = set(category_links_filtered) # remove duplicates
请注意,我还将[
更改为(
,以生成一个生成器,而不是一个列表,并节省了一些内存。搜索有关发电机的更多信息:https://www.python-course.eu/python3_generators.php
好,那么解决您的问题的方法是使用项目管道(https://docs.scrapy.org/en/latest/topics/item-pipeline.html),这确实会对您的函数parse_subcategories
产生的每个项目执行某些操作。您要做的是在pipelines.py
文件中添加一个类,并在settings.py
中启用此管道。这是:
在settings.py
中:
ITEM_PIPELINES = {
'YOURBOTNAME.pipelines.CategoriesPipeline': 300, #the number here is the priority of the pipeline, dont worry and just leave it
}
在pipelines.py
中:
import json
from urlparse import urlparse #this is library to parse urls
class CategoriesPipeline(object):
#This class dynamically saves the data depending on the category name obtained in the url or by an atrtribute
def open_spider(self, spider):
if hasattr(spider, 'filename'):
#the filename is an attribute set by -a filename=somefilename
filename = spider.filename
else:
#you could also set the name dynamically from the start url like this, if you set -a start_url=https://www.site.com.au/category-name
try:
filename = urlparse(spider.start_url).path[1:] #this returns 'category-name' and replace spaces with _
except AttributeError:
spider.crawler.engine.close_spider(self, reason='no start url') #this should not happen
self.file = open(filename+'.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
在spiders/YOURBOTNAME.py
中进行以下修改:
class ChemSpider(scrapy.Spider):
name = "site"
if !hasattr(self, 'start_url'):
spider.crawler.engine.close_spider(self, reason='no start url') #we need a start url
start_urls = [ self.start_url ] #see why this works on https://docs.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
def parse(self, response):#...
,然后使用以下命令开始抓取:scrapy crawl site -a start_url=https://www.site.com.au/category-name
,并且可以选择添加-a filename=somename