是否可以在内部解析时请求新的HTML?
我的代码目前正在读取CSV文件中的HTML链接,然后将所有链接放在start_urls
列表中。
我想要发生的是当它在start_urls
上获得链接时,解析它,在所有页面上循环,直到它满足循环内的条件。中断整个循环并继续解析start_urls
列表
with open('.\scrappy_demo.csv', 'rb') as csvfile:
#Open CSV File here
linkreader = csv.reader(csvfile, dialect=csv.excel)
for row in linkreader:
start_url.append(str(row)[2:-2]+"/search?page=1")
i += 1
class demo(scrapy.Spider):
...
def parse1(self, response):
return response
def parse(self, response):
i = 0;
j = 0;
ENDLOOP = False
...
while(next_page <> current_page and not ENDLOOP):
entry_list = response.css('.entry__row-inner-wrap').extract()
while (i < len(entry_list) and not ENDLOOP):
[Doing some css,xpath filtering here]
if([Some Condition here]):
[Doing some file write here]
ENDLOOP = True
i += 1
j += 1
nextPage = url_redir[:-1]+str(j+1)
body = Request(nextPage, callback=self.parse1)
response2 = HtmlResponse(nextPage, body=body)
在最后两行中,我正在尝试请求新的html,但页码上的增量为+1。但是当代码运行时,它不会返回请求的html代码。我在这里缺少什么?
注意:我尝试通过打印检查body和response2的值,但看起来body.body
为空且回调未执行
注2:第一次使用scrappy
注意3:我知道代码在2位数的页码上失败,但是现在是
的nvm答案 0 :(得分:0)
我认为你可以做这样的事情。请注意,虽然语句被替换,代码看起来更好。你可能想要做一个递归算法来解决这个问题,但我不确定你究竟要做什么。以下代码未经过测试。
...
for url in start_urls:
entry_list = response.css('.entry__row-inner-wrap').extract()
for innerurl in entry_list:
[Doing some css, xpath filtering here]
if([Some Condition here]):
[Doing some file write here]
ENDLOOP = True
body = Request(nextPage, callback=self.parse1)
response2 = HtmlResponse(url, body=body)
答案 1 :(得分:0)
4小时后修复它。
我学习了如何在python中使用Scrapy的“self.make_requests_from_url”和yield
命令。
固定代码:
with open('C:\Users\MDuh\Desktop\scrape\scrappy_demo.csv', 'rb') as csvfile:
linkreader = csv.reader(csvfile, dialect=csv.excel)
for row in linkreader:
start_url.append(str(row)[2:-2]+"/search?page=1")
i += 1
def parse(self, response):
i = 0;
ENDLOOP = False
...
[Pagination CSS/XPATH filtering here]
next_page = int(pagination_response[pagination_parse:pagination_response[pagination_parse:].find('"')+pagination_parse])
current_page = int(str(response.url)[-((len(str(response.url)))-(str(response.url).find('?page='))-6):])
giveaway_list = response.css('.giveaway__row-inner-wrap').extract()
while (i < len(giveaway_list) and not ENDLOOP):
if([CONDITION HERE]):
[FILE WRITE HERE]
ENDLOOP = True
raise StopIteration
i += 1
if(next_page <> current_page and not ENDLOOP):
yield self.make_requests_from_url(url_redir[:-(len(url_redir)-url_redir.find("page=")-5)]+str(next_page))
base_url[:-(len(base_url)-base_url.find("/search?page="))]