使用powerquery进行Webscraping

时间:2017-04-30 07:38:48

标签: excel powerquery

我正在尝试使用Powerquery从Excel 2016中的网站获取数据,但我无法让它工作。服务器返回错误页面。当我将相同的cookie传递到同一个asp页面时,我确实在Chrome和Postman(Chrome应用程序)中获得了我想要的页面。

代码:

    let
    Source = Web.Page(Web.Contents("http://portal.icuregswe.org/utdata/_render.aspx", [Headers=[Cookie="__utmt=1; ASP.NET_SessionId=wr4drsm5nqctyk55qcecgiap; __utma=223509914.878319927.1493184252.1493492055.1493534562.4; __utmb=223509914.3.10.1493534562; __utmc=223509914; __utmz=223509914.1493534562.4.4.utmcsr=icuregswe.org|utmccn=(referral)|utmcmd=referral|utmcct=/sv/Utdata/Utdataportal-Ny/; __utma=187689776.292092926.1493485249.1493492045.1493534550.3; __utmb=187689776.3.10.1493534550; __utmc=187689776; __utmz=187689776.1493485249.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); ASP.NET_SessionId=wr4drsm5nqctyk55qcecgiap; __utma=223509914.878319927.1493184252.1493492055.1493534562.4; __utmb=223509914.3.10.1493534562; __utmc=223509914; __utmz=223509914.1493534562.4.4.utmcsr=icuregswe.org|utmccn=(referral)|utmcmd=referral|utmcct=/sv/Utdata/Utdataportal-Ny/"]])),
    Data0 = Source{0}[Data]
in
    Data0

DOM结构:

DOM structure

错误讯息:

System.NullReferenceException: Object reference not set to an instance of an object. 
at _render.Page_Load(Object sender, EventArgs e) 
at System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr fp, Object o, Object t, EventArgs e) 
at System.Web.Util.CalliEventHandlerDelegateProxy.Callback(Object sender, EventArgs e) 
at System.Web.UI.Control.OnLoad(EventArgs e) 
at System.Web.UI.Control.LoadRecursive() 
at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) 

怀疑服务器缺少一些输入来生成页面(在Chrome中使用开发工具可以看到有几次调用服务器,我不确定它在这方面是如何工作的。)

主页面在这里: http://portal.icuregswe.org/utdata/

通过菜单访问报告,例如:Rapporter-> Produktion->Vårdtid->Vårddygnsumma

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

编辑:我以前认为我已经弄明白但是一旦我试图在同一个网站上搜索不同的报告,我意识到它无法正常工作。我想出了这个涉及VB脚本和Power Query的解决方案:

要使用其中一个,需要通过选择“Urval”,选择时间段并选择报告,从网站获取具有会话ID的Cookie值。这将返回带有会话ID的cookie,将此值复制到单元格B4(必须命名为cookievalue)。设置完成后,单击一个更新按钮,执行下面的VB脚本。这将调用网站为当前会话ID设置报告类型,然后更新将从站点获取CSV的Power Query。

工作表名为参数。

Excel screenshot

Power Query从网站获取CSV:

let
    cookiestr = Excel.CurrentWorkbook(){[Name="cookievalue"]}[Content]{0}[Column1],
    Source = Excel.Workbook(Web.Contents("http://portal.icuregswe.org/utdata/ExcelExport.aspx", [Headers=[Cookie=cookiestr]]), null, true),
    #"SIR-rapport_Sheet" = Source{[Item="SIR-rapport",Kind="Sheet"]}[Data]
in
    #"SIR-rapport_Sheet"

VB脚本调用网站并设置报告类型并更新Power Query:

Sub Button1_Click()

Dim URL As String
URL = Sheets("parameters").Range("B2")
Dim param As String
param = Sheets("parameters").Range("B3")
Dim cookie As String
cookie = Sheets("parameters").Range("B4")

Dim w As New WinHttp.WinHttpRequest
w.Open "POST", URL & param, False
w.setRequestHeader "Cookie", cookie
w.send qs

'Macro to update Power Query script(s)
Dim lTest As Long, cn As WorkbookConnection
On Error Resume Next
For Each cn In ThisWorkbook.Connections
lTest = InStr(1, cn.OLEDBConnection.Connection, "Provider=Microsoft.Mashup.OleDb.1", vbTextCompare)
If Err.Number <> 0 Then
Err.Clear
Exit For
End If
If lTest > 0 Then cn.Refresh
Next cn
End Sub

来源: