Question

我正在努力弄清我需要设置的代码结构，以便在多个页面中抓取多个页面。这是我的意思：

我从具有所有字母URL的主页开始。每个字母都是狗品种名称的首字母。
每个字母都有多页犬种。我需要进入每个犬种页面。
对于每种犬，都有多页出售的犬。我需要从每个销售列表页面中提取数据。

如前所述，我正在努力了解代码结构的外观。问题的一部分是我不完全了解python代码流的工作方式。这样的话是正确的吗？

def parse
       Get URL of all the alphabet letters
       pass on the URL to parse_A

def parse_A
      Get URL of all pages for that alphabet letter
      pass on the URL to parse_B

def parse_B
      Get URL for all breeds listed on that page of that alphabet letter
      pass on the URL to parse_C

def parse_C
      Get URL for all the pages of dogs listed of that specific breed
      pass on the URL to parse_D

def parse_D
      Get URL of specific for sale listing of that dog breed on that page
      pass on the URL to parse_E

def parse_E
     Get all of the details for that specific listing
     Callback to ??

对于parse_E中的最终回调，我应将回调定向到parse_D还是第一个解析？

谢谢！

Answer 1

您必须使用scrapy遵循以下结构。

def parse():
    """
    Get URL of all URLs from the alphabet letters (breed_urls)
    :return:
    """
    breed_urls = 'parse the urls'
    for url in breed_urls:
        yield scrapy.Request(url=url, callback=self.parse_sub_urls)


def parse_sub_urls(response):
    """
    Get URL of all SubUrls from the subPage (sub_urls)
    :param response:
    :return:
    """
    sub_urls= 'parse the urls'
    for url in sub_urls:
        yield scrapy.Request(url=url, callback=self.parse_details)

    next_page = 'parse the page url'
    if next_page:
        yield scrapy.Request(url=next_page, callback=self.parse_sub_urls)

def parse_details(response):
    """
    Get the final details from the listing page
    :param response:
    :return:
    """

    details = {}
    name = 'parse the urls'
    details['name'] = name

    # parse all other details and append to the dictionary

    yield details

在多页中刮多页（抓取）

1 个答案: