我正尝试从此网站(“ https://www.karl.com/experience/en/?yoox_storelocator_action=true&action=yoox_storelocator_get_all_stores”上删除所有信息 但我无法将其写入文件中。我的文件甚至都没有创建。这是我的代码:
import scrapy # Scraper
import json # JSON manipulation
import jsonpickle # Object serializer
class Karl(scrapy.Spider):
# Needed var
name = 'Karl' # Spider's name
url = "https://www.karl.com/experience/en/?yoox_storelocator_action=true&action=yoox_storelocator_get_all_stores"
start_url = [
url,
]
# Called from Scrapy itself
def parse(self, response):
filename = '%s.json' % self.name
response = json.loads(response.body)
response = jsonpickle.encode(response)
with open(filename, 'w') as f: # Save the JSON file created
f.write(response)
当我运行草率爬行的Karl时,这些是我得到的最后几行:
2018-07-24 16:02:25 [scrapy.core.engine] INFO: Spider opened
2018-07-24 16:02:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0
pages/min), scraped 0 items (at 0 items/min)
2018-07-24 16:02:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-24 16:02:26 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-24 16:02:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 7, 24, 14, 2, 26, 861204),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'memusage/max': 54804480,
'memusage/startup': 54804480,
'start_time': datetime.datetime(2018, 7, 24, 14, 2, 26, 550318)}
你们能帮我吗?我正在使用scrapy很长时间了,这是第一次。谢谢
答案 0 :(得分:1)
您的蜘蛛网中有一个错误:start_url
应该是start_urls
,另外还需要一个变量allowed_domains
。同样,也不需要另外声明url
。
您的代码应为:
class Karl(scrapy.Spider):
name = 'Karl'
start_urls = ["https://www.karl.com/experience/en/?yoox_storelocator_action=true&action=yoox_storelocator_get_all_stores"]
allowed_domains = "karl.com"
## Snip ##
您还可以使用scrapy genspider
生成一个新的蜘蛛,它将使用默认模板,在这种情况下可能会有所帮助。