我正在努力弄清我需要设置的代码结构,以便在多个页面中抓取多个页面。这是我的意思:
如前所述,我正在努力了解代码结构的外观。问题的一部分是我不完全了解python代码流的工作方式。这样的话是正确的吗?
def parse
Get URL of all the alphabet letters
pass on the URL to parse_A
def parse_A
Get URL of all pages for that alphabet letter
pass on the URL to parse_B
def parse_B
Get URL for all breeds listed on that page of that alphabet letter
pass on the URL to parse_C
def parse_C
Get URL for all the pages of dogs listed of that specific breed
pass on the URL to parse_D
def parse_D
Get URL of specific for sale listing of that dog breed on that page
pass on the URL to parse_E
def parse_E
Get all of the details for that specific listing
Callback to ??
对于parse_E中的最终回调,我应将回调定向到parse_D还是第一个解析?
谢谢!
答案 0 :(得分:2)
您必须使用scrapy遵循以下结构。
def parse():
"""
Get URL of all URLs from the alphabet letters (breed_urls)
:return:
"""
breed_urls = 'parse the urls'
for url in breed_urls:
yield scrapy.Request(url=url, callback=self.parse_sub_urls)
def parse_sub_urls(response):
"""
Get URL of all SubUrls from the subPage (sub_urls)
:param response:
:return:
"""
sub_urls= 'parse the urls'
for url in sub_urls:
yield scrapy.Request(url=url, callback=self.parse_details)
next_page = 'parse the page url'
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse_sub_urls)
def parse_details(response):
"""
Get the final details from the listing page
:param response:
:return:
"""
details = {}
name = 'parse the urls'
details['name'] = name
# parse all other details and append to the dictionary
yield details