使用Python从.ASPX网站URL抓取数据

时间:2020-07-10 02:17:10

标签: python asp.net web-scraping beautifulsoup html-post

我有一个要抓取的静态.aspx网址。我所有的尝试都产生了常规网站的原始html数据,而不是我要查询的数据。

我的理解是我正在使用的标头(我从另一篇文章中找到)是正确且可概括的:

import urllib.request
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Accept-Encoding': 'gzip,deflate,sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}

class MyOpener(urllib.request.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'https://www.mytaxcollector.com/trSearch.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup_dummy = BeautifulSoup(f,"html5lib")
# parse and retrieve two vital form values
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']

尝试输入表单数据不会导致任何事情发生:

formData = (
    ('__VIEWSTATE', viewstate),
    ('__VIEWSTATEGENERATOR', viewstategen),
    ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000'),
    ('__EVENTTARGET', 'ct100$MainContent$calculate')
)

encodedFields =  urllib.parse.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)


soup = BeautifulSoup(f,"html5lib")
trans_emissions = soup.find("span", id="ctl00_MainContent_transEmissions")
print(trans_emissions.text)

这将为原始html代码提供几乎与“ soup_dummy”变量完全相同的代码。但是我想看到的是正在提交的字段(“ ctl00_contentHolder_trSearchCharactersAPN”,“ 631091430000”)的数据(这是“包裹编号”框。

我非常感谢您的帮助。如果有的话,将我链接到一篇有关HTML请求的好帖子(不仅可以解释而且实际上会通过抓取aspx的内容)将是很棒的。

1 个答案:

答案 0 :(得分:1)

要使用宗地编号获得结果,您的参数必须与尝试使用的参数有所不同。此外,您必须使用此URL https://www.mytaxcollector.com/trSearchProcess.aspx发送发帖请求。

工作代码:

from urllib.request import Request, urlopen
from urllib.parse import urlencode
from bs4 import BeautifulSoup

url = 'https://www.mytaxcollector.com/trSearchProcess.aspx'

payload = {
    'hidRedirect': '',
    'hidGotoEstimate': '',
    'txtStreetNumber': '',
    'txtStreetName': '',
    'cboStreetTag': '(Any Street Tag)',
    'cboCommunity': '(Any City)',
    'txtParcelNumber': '0108301010000',  #your search term
    'txtPropertyID': '',
    'ctl00$contentHolder$cmdSearch': 'Search'
}

data = urlencode(payload)
data = data.encode('ascii')
req = Request(url,data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')
res = urlopen(req)
soup = BeautifulSoup(res.read(),'html.parser')
for items in soup.select("table.propInfoTable tr"):
    data = [item.get_text(strip=True) for item in items.select("td")]
    print(data)