从.aspx页面

时间:2017-08-16 20:02:12

标签: python asp.net web-scraping

from bs4 import BeautifulSoup
from pprint import pprint
import requests

url = 'http://estadistico.ut.com.sv/OperacionDiaria.aspx'

s = requests.Session()

pagereq = s.get(url)
soup = BeautifulSoup(pagereq.content, 'lxml')

viewstategenerator = soup.find("input", attrs = {'id': '__VIEWSTATEGENERATOR'})['value']
viewstate = soup.find("input", attrs = {'id': '__VIEWSTATE'})['value']
eventvalidation = soup.find("input", attrs = {'id': '__EVENTVALIDATION'})['value']

eventtarget = 'ASPxDashboardViewer1'
DXCss = '1_33,1_4,1_9,1_5,15_2,15_4'
DXScript = '1_232,1_134,1_225,1_169,1_187,15_1,1_183,1_182,1_140,1_147,1_148,1_142,1_141,1_143,1_144,1_145,1_146,15_0,15_6,15_7'
eventargument = {"Task":"Export","ExportInfo":{"Mode":"SingleItem","GroupName":"pivotDashboardItem1","FileName":"Generación+por+tipo+de+tecnología+(MWh)","ClientState":{"clientSize":{"width":509,"height":385},"titleHeight":48,"itemsState":[{"name":"pivotDashboardItem1","headerHeight":34,"position":{"left":11,"top":146},"width":227,"height":108,"virtualSize":'null',"scroll":{"horizontal":'true',"vertical":'true'}}]},"Format":"Excel","DocumentOptions":{"paperKind":"Letter","pageLayout":"Portrait","scaleMode":"AutoFitWithinOnePage","scaleFactor":1,"autoFitPageCount":1,"showTitle":'true',"title":"Operación+Diaria","imageFormatOptions":{"format":"Png","resolution":96},"excelFormatOptions":{"format":"Csv","csvValueSeparator":","},"commonOptions":{"filterStatePresentation":"None","includeCaption":'true',"caption":"Generación+por+tipo+de+tecnología+(MWh)"},"pivotOptions":{"printHeadersOnEveryPage":'true'},"gridOptions":{"fitToPageWidth":'true',"printHeadersOnEveryPage":'true'},"chartOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"pieOptions":{"autoArrangeContent":'true'},"gaugeOptions":{"autoArrangeContent":'true'},"cardOptions":{"autoArrangeContent":'true'},"mapOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"rangeFilterOptions":{"automaticPageLayout":'true',"sizeMode":"Stretch"},"imageOptions":{},"fileName":"Generación+por+tipo+de+tecnología+(MWh)"},"ItemType":"PIVOT"},"Context":"BwAHAAIkY2NkNWRiYzItYzIwNS00MDIyLTkzZjUtYWQ0NzVhYTM5Y2E3Ag9PcGVyYWNpb25EaWFyaWECAAIAAAAAAMByQA==","RequestMarker":1,"ClientState":{}}

postdata = {'__EVENTTARGET': eventtarget,
            '__EVENTARGUMENT': eventargument,
            '__VIEWSTATE': viewstate,
            '__VIEWSTATEGENERATOR': viewstategenerator,
            '__EVENTVALIDATION': eventvalidation,
            'DXScript': DXScript,
            'DXCss': DXCss
           }

datareq = s.post(url, data = postdata)

print datareq.text

我试图从this .aspx网页抓取数据。该页面通过javascript动态加载数据,因此直接使用requests / BeautifulSoup进行抓取工作无法正常工作。

通过查看网络流量,我可以看到当您单击元素的导出(Exportar a)按钮时,选择导出类型(excel,csv),然后确认对页面发出POST请求。它返回我需要的数据的base64编码字符串。据我所知,没有办法直接对文件发出GET请求,因为它仅在请求时生成。

我要做的是复制POST请求,触发csv响应。所以首先我要抓取__VIEWSTATE,__ reviewSTATEGENERATOR和__EVENTVALIDATION。 __EVENTTARGET,DXCSS和DXScript看起来是固定的。 __EENTENTGUMENT直接从POST请求中复制。

我的代码返回服务器应用程序错误。我认为问题要么是a)错误__EVENTARGUMENT(可能是部分动态而不是固定?),b)没有真正理解.aspx页面是如何工作的,或者c)我试图做的事情并非如此。可以使用这些工具。

我确实看过使用selenium来触发数据导出,但我看不到捕获服务器响应的方法。

1 个答案:

答案 0 :(得分:0)

我能够从比我更了解aspx页面的人那里获得帮助。

链接到提供解决方案的Github要点。

https://gist.github.com/jarek/d73c672d8dd4ddb48d80bffc4d8038ba