我按照这篇文章SCRAPING WEBSITES BASED ON VIEWSTATES WITH SCRAPY的标题刮了一个几乎相同的网站。它运作良好,但问题是我的网站上有很多物品,因此分页很多。我可以转到下一页,但前提是只能从我所在的页面上看到它们。分页最多10页,这意味着页面1的ViewState仅在我转到下一个页面(例如第14页)时才适用于第一个页面,由于它仍使用页面1的ViewState,因此无法获取数据。 / p>
这里是代码:首先进入第1页,然后使用它转到最后一页以确定页数。然后我遍历每一页。在循环中,传递的响应来自最后一页,该响应仅适用于从分页最后一页可见的最后10页。
def parse(self, response):
# Fetch the first page from the site
formdata = update_formdata(FORM_DATA, response)
formdata["ctl00$Body$ButtonSubmit"] = "Submit"
# Pass the formdata to mimic what a user does in the browser
yield scrapy.FormRequest(
response.url, formdata=formdata, callback=self.parse_first_page
)
def parse_first_page(self, response):
# get first page actual data
yield from get_data(response)
# Check if there is a last page
last_page = response.css(
"tr.pager table td:last-child a::text"
).get() or response.css(
"tr.pager table td:last-child a font::text"
).get()
if last_page is not None:
last_page = last_page.strip().lower()
if last_page == "last page":
# Load the data for the last page
formdata = update_formdata(FORM_DATA, response)
formdata["__EVENTARGUMENT"] = "Page$Last"
formdata["__EVENTTARGET"] = "ctl00$Body$GridView1"
if formdata.get("ctl00$Body$ButtonSubmit", None) is not None:
del formdata["ctl00$Body$ButtonSubmit"]
yield scrapy.FormRequest(
response.url,
formdata=formdata,
callback=self.parse_last_page,
)
elif last_page.isdigit():
last_page_num = int(last_page)
yield from self.parse_other_pages(
last_page_num, response
)
else:
self.logger.error("No last Page")
def parse_last_page(self, response):
# Get last page actual data
yield from get_data(response)
# Get the last page number
last_page_num = response.css(
"tr.pager table td:last-child span::text"
).get()
if last_page_num is not None:
counter = int(last_page_num) - 1
yield from self.parse_other_pages(counter, response)
def parse_other_pages(self, page_num, response): # last page response
# get the number of pages and loop through all the pages
while page_num >= 2: # uses last page response needs to change to current page response??
formdata = update_formdata(FORM_DATA, response)
formdata["__EVENTTARGET"] = "ctl00$Body$GridView1"
if formdata.get("ctl00$Body$ButtonSubmit", None) is not None:
del formdata["ctl00$Body$ButtonSubmit"]
formdata.update(__EVENTARGUMENT="Page$" + str(page_num))
page_num_cpy = page_num
page_num -= 1
yield scrapy.FormRequest(
response.url,
method="POST",
formdata=formdata,
callback=self.parse_results,
dont_filter=True,
headers=HEADERS,
priority=1,
meta={"page_num": page_num_cpy},
)
def parse_results(self, response): #current page response
# Get the actual data for all the other pages
yield from get_data(response)
编辑 how to scrape a page request using Viewstate parameter?这个问题解释了我已经在做什么。我的问题不是如何从响应中获取ViewState并将其传递给下一个请求。我已经可以实现。我的问题是我需要在循环中更新响应,以便它传递上一页的ViewState。现在,它只经过最后一页,其视图状态将在10页之后失效。