如何使用aspx和viewstate抓取页面?

时间:2018-07-18 21:09:46

标签: python asp.net python-3.x web-scraping phantomjs

我正在尝试抓取该网站中的人员地址-https://aca.accela.com/ALAMEDA/Cap/CapHome.aspx?module=Building&TabName=Building

这是HTML的一部分的外观-

                                                      2018年7月10日                                                                                                                                                                                                        SPV18-0037                         

                    <input type="hidden" id="RecordId" value="18SPV-00000-00039">
                </div>
            </td><td class="ACA_AlignLeftOrRightTop" style="width:130px;">
                <div class="ACA_CapListStyle">
                    <span id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_lblType">Solar Photovoltaic System Residential</span>
                </div>
            </td><td class="ACA_AlignLeftOrRightTop" style="width:130px;">
                <div class="ACA_CapListStyle">

                    <span id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_lblDescription">INSTALL A 5.12 KW SOLAR SYSTEM, ROOFTOP, FLUSH MOUNT, 16 PANELS (BLDG)</span>
                </div>
            </td><td class="ACA_AlignLeftOrRightTop" style="width:130px;">
                <div class="ACA_CapListStyle">
                    <span id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_lblAddress">1623 CLINTON AVE, ALAMEDA CA 94501</span>
                </div>
            </td><td class="ACA_AlignLeftOrRightTop" style="width:100px;">
                <div style="white-space: nowrap;" class="ACA_CapListStyle">
                    <div id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_panelStatus">

                        <span id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_lblStatus">Plan Review</span>

    </div>
                    <div id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_panelbtnRenewalDetail">



    </div>
                </div>
            </td><td class="ACA_AlignLeftOrRightTop" style="width:100px;">
                <div style="white-space: nowrap;" class="ACA_CapListStyle">
                    <div id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_Panel2">




    </div>
                    <div id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_Panel3">



    </div>
                    <div id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_Panel4">



    </div>
                    <div id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_Panel5">



    </div>


                </div>
            </td><td class="ACA_AlignLeftOrRightTop" style="width:110px;">
                <div class="ACA_CapListStyle">
                    <span id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_lblShortNote">INSTALL A 5.12 KW SOLAR SYSTEM, ROOFTOP, FLUSH MOU</span>
                </div>
            </td><td class="ACA_Hide">
                <div class="ACA_CapListStyle">
                    <span id="ctl00_PlaceHolderMain_dgvPermitList_gdvPermitList_ctl02_lblPermitAddress">1623 CLINTON AVE, ALAMEDA CA 94501</span>
                </div>
            </td>
</tr>

如您在最后一个span标签中所见,地址为'1623 CLINTON AVE,ALAMEDA CA 94501',并且有很多这样的span标签,并且在末尾也有分页。

我正在尝试从td列获取跨度的文本值。这是我在python中的代码-

import bs4
from bs4 import BeautifulSoup
try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import os

from lxml import html

import requests
import csv
import sys



for page_no in range(2, 50):
    curr_page = str(page_no)
    if page_no < 9 :
        data = {
            'ctl00$ScriptManager1': 'ctl00$PlaceHolderMain$dgvPermitList$updatePanel|ctl00$PlaceHolderMain$dgvPermitList$gdvPermitList$ctl13$ctl0'+curr_page,
            'ctl00$PlaceHolderMain$generalSearchForm$ddlGSPermitType': 'Building/Solar Photovoltaic System/Residential/NA',
            'ctl00$PlaceHolderMain$generalSearchForm$txtGSStartDate': '01/01/2008',
            'ctl00$PlaceHolderMain$generalSearchForm$txtGSEndDate': '07/17/2018'
            }
    else :
        data = {
            'ctl00$ScriptManager1': 'ctl00$PlaceHolderMain$dgvPermitList$updatePanel|ctl00$PlaceHolderMain$dgvPermitList$gdvPermitList$ctl13$ctl1'+curr_page,
            'ctl00$PlaceHolderMain$generalSearchForm$ddlGSPermitType': 'Building/Solar Photovoltaic System/Residential/NA',
            'ctl00$PlaceHolderMain$generalSearchForm$txtGSStartDate': '01/01/2008',
            'ctl00$PlaceHolderMain$generalSearchForm$txtGSEndDate': '07/17/2018'

        }

    page = requests.get('https://aca.accela.com/ALAMEDA/Cap/CapHome.aspx?module=Building&TabName=Building', data = data)

    soup = BeautifulSoup(page.text, 'html.parser')


    print('Page ', page_no)
    #print(soup)

   address = soup.find_all('tbody')
   print(address)

当我尝试打印此标签或任何其他标签时,它只会返回一个空数组(或列表)。经过大量研究后,我发现的问题似乎是该页面是使用aspx构建的,并且每次都会生成Viewstate。 谁能指导我如何将其包含在我的代码中?

0 个答案:

没有答案