Spider设置回调后不会递归调用自己

时间:2018-07-08 16:59:06

标签: scrapy scrapy-spider

我的项目的目标是在网站上搜索公司电话号码。

我正在尝试解析网页和正则表达式中的电话号码(我正在处理该部分),然后在页面上查找链接。这些链接是我要递归调用的。因此,我将在那些链接上调用该函数并重复执行。 但是,该功能仅运行一次。参见下面的代码:

def parse(self, response):
    # The main method of the spider. It scrapes the URL(s) specified in the
    # 'start_url' argument above. The content of the scraped URL is passed on
    # as the 'response' object.

    hxs = HtmlXPathSelector(response)

    #print(phone_detail)
    print('here')
    for phone_num in response.xpath('//body').re(r'\d{3}.\d{3}.\d{4}'):
        item = PhoneNumItem()
        item['label'] = "a"
        item['phone_num'] = phone_num
        yield item

    for url in hxs.xpath('//a/@href').extract():
        # This loops through all the URLs found 
        # Constructs an absolute URL by combining the responses URL with a possible relative URL:
        next_page = response.urljoin(url)
        print("Found URL: " + next_page)

        #yield response.follow(next_page, self.parse_page)
        yield scrapy.Request(next_page, callback=self.parse)

请让我知道您的想法...对我来说,这段代码似乎应该起作用,但事实并非如此。

0 个答案:

没有答案