Python和爬行新手,由于某种原因,下面的代码在被调用时没有执行该函数,它甚至不输出" test"打印声明。
主解析执行正常,只是对函数的调用,我已经尝试了许多不同的方法来调用它无济于事。
import scrapy
from myproject.items import MyHierarchyItem
class MySpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["example.com"]
start_urls = ['example.com']
def parse(self, response):
print("Starting parse_hierarchy")
HierarchyItem = MyHierarchyItem()
StartLvl3URLS = []
sitemap = response.css("div.sitemap-content > div.row")
for lvl1 in sitemap:
HierarchyItem["hierarchy_lvl1_name"] = lvl1.css("h2::text").extract()
#print(lvl1.css("h2::text").extract())
currentlvl2 = lvl1.css("li.span-6")
for lvl2 in currentlvl2:
HierarchyItem["hierarchy_lvl2_name"] = lvl2.css("h4::text").extract()
currentlvl3 = lvl2.css("ul.child > li")
#print(lvl2.css("h4::text").extract())
for lvl3 in currentlvl3:
#print(lvl3.css("a::text").extract())
#print(lvl3.css("a::attr(href)").extract())
HierarchyItem["hierarchy_lvl3_name"] = lvl3.css("a::text").extract()
HierarchyItem["hierarchy_url"] = lvl3.css("a::attr(href)").extract()
StartLvl3URLS.append(HierarchyItem["hierarchy_url"])
yield HierarchyItem
full_link = StartLvl3URLS[0]
#for lvl3 in StartLvl3URLS
yield scrapy.Request(str(full_link), self.parse_category)
def parse_category(self, response):
print("test")
print(len(reponse.body))
print(response.body)
日志提取
2017-04-08 23:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.example.com/sitemap>
{'hierarchy_lvl1_name': ['cat1'],
'hierarchy_lvl2_name': ['cat2'],
'hierarchy_lvl3_name': ['cat3'],
'hierarchy_url': ['http://www.example.com/cat1/cat2/cat3']}
2017-04-08 23:58:03 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-08 23:58:03 [scrapy.extensions.feedexport] INFO: Stored csv feed (445 items) in: hierarchy.csv
2017-04-08 23:58:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 205,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 24223,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 8, 13, 58, 3, 154254),
'httpcache/hit': 1,
'item_scraped_count': 445,
'log_count/DEBUG': 447,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 4, 8, 13, 58, 2, 614750)}
2017-04-08 23:58:03 [scrapy.core.engine] INFO: Spider closed (finished)
答案 0 :(得分:0)
根据我的知识,Scrapy不会通过print()
方法打印输出打印。
你可以做到
import logging
logging.info("message here")
logging.error("message here")
logging.warning("message here")
同时在浏览器中禁用JavaScript并打开您正在抓取的网站。然后检查div.sitemap-content > div.row
选择器是否返回任何元素。
答案 1 :(得分:0)
发现问题,问题是因为我使用extract()
它的输出是一个列表,所以我在列表中有一个列表(只有一个元素)并且请求没有调用网址,将其更改为extract_first()
,现在可以正常使用。
HierarchyItem["hierarchy_url"] = lvl3.css("a::attr(href)").extract_first()