Scrapy:请求内部解析的不同html()

时间:2016-06-22 23:36:58

标签: python scrapy scrapy-spider

是否可以在内部解析时请求新的HTML?

我的代码目前正在读取CSV文件中的HTML链接,然后将所有链接放在start_urls列表中。

我想要发生的是当它在start_urls上获得链接时,解析它,在所有页面上循环,直到它满足循环内的条件。中断整个循环并继续解析start_urls列表

上的下一个项目
with open('.\scrappy_demo.csv', 'rb') as csvfile:
    #Open CSV File here
    linkreader = csv.reader(csvfile, dialect=csv.excel)
    for row in linkreader:
        start_url.append(str(row)[2:-2]+"/search?page=1")
        i += 1

class demo(scrapy.Spider):
    ...
    def parse1(self, response):
        return response

    def parse(self, response):
        i = 0;
        j = 0;
        ENDLOOP = False
        ...
        while(next_page <> current_page and not ENDLOOP):
                entry_list = response.css('.entry__row-inner-wrap').extract()

                while (i < len(entry_list) and not ENDLOOP):
                        [Doing some css,xpath filtering here]
                        if([Some Condition here]):
                                [Doing some file write here]
                                ENDLOOP = True
                        i += 1

                j += 1
                nextPage = url_redir[:-1]+str(j+1)
                body = Request(nextPage, callback=self.parse1)
                response2 = HtmlResponse(nextPage, body=body)

在最后两行中,我正在尝试请求新的html,但页码上的增量为+1。但是当代码运行时,它不会返回请求的html代码。我在这里缺少什么?

注意:我尝试通过打印检查body和response2的值,但看起来body.body为空且回调未执行

注2:第一次使用scrappy

注意3:我知道代码在2位数的页码上失败,但是现在是

的nvm

2 个答案:

答案 0 :(得分:0)

我认为你可以做这样的事情。请注意,虽然语句被替换,代码看起来更好。你可能想要做一个递归算法来解决这个问题,但我不确定你究竟要做什么。以下代码未经过测试。

    ...
    for url in  start_urls:
        entry_list = response.css('.entry__row-inner-wrap').extract()

        for innerurl in entry_list:
                [Doing some css, xpath filtering here]
                if([Some Condition here]):
                        [Doing some file write here]
                        ENDLOOP = True

        body = Request(nextPage, callback=self.parse1)
        response2 = HtmlResponse(url, body=body)

答案 1 :(得分:0)

4小时后修复它。

我学习了如何在python中使用Scrapy的“self.make_requests_from_url”和yield命令。

固定代码:

with open('C:\Users\MDuh\Desktop\scrape\scrappy_demo.csv', 'rb') as csvfile:
linkreader = csv.reader(csvfile, dialect=csv.excel)
for row in linkreader:
    start_url.append(str(row)[2:-2]+"/search?page=1")
    i += 1

def parse(self, response):
        i = 0;
        ENDLOOP = False
        ...
        [Pagination CSS/XPATH filtering here]
        next_page = int(pagination_response[pagination_parse:pagination_response[pagination_parse:].find('"')+pagination_parse])
        current_page = int(str(response.url)[-((len(str(response.url)))-(str(response.url).find('?page='))-6):])

        giveaway_list = response.css('.giveaway__row-inner-wrap').extract()
        while (i < len(giveaway_list) and not ENDLOOP):
                if([CONDITION HERE]):
                    [FILE WRITE HERE]
                    ENDLOOP = True
                    raise StopIteration
                i += 1

        if(next_page <> current_page and not ENDLOOP):
           yield self.make_requests_from_url(url_redir[:-(len(url_redir)-url_redir.find("page=")-5)]+str(next_page))

        base_url[:-(len(base_url)-base_url.find("/search?page="))]