使用Scrapy使用VIEWSTATE抓取ASP.NET页面

时间:2019-11-17 13:56:29

标签: python asp.net scrapy viewstate

我按照这篇文章SCRAPING WEBSITES BASED ON VIEWSTATES WITH SCRAPY的标题刮了一个几乎相同的网站。它运作良好,但问题是我的网站上有很多物品,因此分页很多。我可以转到下一页,但前提是只能从我所在的页面上看到它们。分页最多10页,这意味着页面1的ViewState仅在我转到下一个页面(例如第14页)时才适用于第一个页面,由于它仍使用页面1的ViewState,因此无法获取数据。 / p>

这里是代码:首先进入第1页,然后使用它转到最后一页以确定页数。然后我遍历每一页。在循环中,传递的响应来自最后一页,该响应仅适用于从分页最后一页可见的最后10页。

def parse(self, response):
    # Fetch the first page from the site
    formdata = update_formdata(FORM_DATA, response)
    formdata["ctl00$Body$ButtonSubmit"] = "Submit"

    # Pass the formdata to mimic what a user does in the browser
    yield scrapy.FormRequest(
        response.url, formdata=formdata, callback=self.parse_first_page
    )

def parse_first_page(self, response):
    # get first page actual data
    yield from get_data(response)

    # Check if there is a last page
    last_page = response.css(
        "tr.pager table td:last-child a::text"
    ).get() or response.css(
        "tr.pager table td:last-child a font::text"
    ).get()

    if last_page is not None:
        last_page = last_page.strip().lower()

        if last_page == "last page":
            # Load the data for the last page
            formdata = update_formdata(FORM_DATA, response)
            formdata["__EVENTARGUMENT"] = "Page$Last"
            formdata["__EVENTTARGET"] = "ctl00$Body$GridView1"
            if formdata.get("ctl00$Body$ButtonSubmit", None) is not None:
                del formdata["ctl00$Body$ButtonSubmit"]
            yield scrapy.FormRequest(
                response.url,
                formdata=formdata,
                callback=self.parse_last_page,
            )

        elif last_page.isdigit():
            last_page_num = int(last_page)
            yield from self.parse_other_pages(
                last_page_num, response
            )
        else:
            self.logger.error("No last Page")

def parse_last_page(self, response):
    # Get last page actual data
    yield from get_data(response)

    # Get the last page number
    last_page_num = response.css(
        "tr.pager table td:last-child span::text"
    ).get()
    if last_page_num is not None:
        counter = int(last_page_num) - 1
        yield from self.parse_other_pages(counter, response)

def parse_other_pages(self, page_num, response): # last page response
    # get the number of pages and loop through all the pages

    while page_num >= 2: # uses last page response needs to change to current page response??
        formdata = update_formdata(FORM_DATA, response)
        formdata["__EVENTTARGET"] = "ctl00$Body$GridView1"
        if formdata.get("ctl00$Body$ButtonSubmit", None) is not None:
                del formdata["ctl00$Body$ButtonSubmit"]

        formdata.update(__EVENTARGUMENT="Page$" + str(page_num))
        page_num_cpy = page_num
        page_num -= 1
        yield scrapy.FormRequest(
            response.url,
            method="POST",
            formdata=formdata,
            callback=self.parse_results,
            dont_filter=True,
            headers=HEADERS,
            priority=1,
            meta={"page_num": page_num_cpy},
        )

def parse_results(self, response): #current page response
    # Get the actual data for all the other pages      
    yield from get_data(response)

编辑 how to scrape a page request using Viewstate parameter?这个问题解释了我已经在做什么。我的问题不是如何从响应中获取ViewState并将其传递给下一个请求。我已经可以实现。我的问题是我需要在循环中更新响应,以便它传递上一页的ViewState。现在,它只经过最后一页,其视图状态将在10页之后失效。

我要抓取的网站是https://www.mevzuat.gov.tr/Kanunlar.aspx

0 个答案:

没有答案