屏幕刮取ASP.NET页面不起作用

时间:2011-08-04 19:12:25

标签: c# asp.net httpwebrequest screen-scraping

我正在尝试在以下网站上恢复页面上的日历活动:http://www.wphospital.org/News-Events/Calendar-of-Events.aspx

请注意,此站点有一个名为“Month”的链接 - 我需要能够POST数据请求特定月份的日历事件。我不能让这个工作。这是代码:

private static void GetData(ref string buf)
{
    try
    {
        //First, request the search form to get the viewstate value 
        HttpWebRequest webRequest = default(HttpWebRequest);
        webRequest = (HttpWebRequest)System.Net.WebRequest.Create("http://www.wphospital.org/News-Events/Calendar-of-Events.aspx");
        StreamReader responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());
        string responseData = responseReader.ReadToEnd();
        responseReader.Close();

        //Extract the viewstate value and build out POST data 
        string viewState = ExtractViewState(responseData);
        string eventValidation = ExtractEventValidation(responseData);
        string postData = null;

        postData = String.Format("ctl00$manScript={0}&__EVENTTARGET=&__EVENTARGUMENT&__LASTFOCUS=&__VIEWSTATE={1}&lng={2}&__EVENTVALIDATION={3}&ctl00$searchbox1$txtWord={4}&textfield2={5}&ctl00$plcMain$lstbxCategory={6}&ctl00$plcMain$lstbxSubCategory={7}", "ctl00$plcMain$updMonthNav|ctl00$plcMain$btnNextMonth", viewState, "en-US", eventValidation, "Search", "your search here", 0, 0);

        var encoding = new ASCIIEncoding();
        byte[] data = encoding.GetBytes(postData);

        //Now post to the search form 
        webRequest = (HttpWebRequest)System.Net.WebRequest.Create("http://www.wphospital.org/News-Events/Calendar-of-Events.aspx");
        webRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)";
        webRequest.Method = "POST";
        webRequest.ContentType = "application/x-www-form-urlencoded";
        webRequest.ContentLength = data.Length;

        var newStream = webRequest.GetRequestStream();
        newStream.Write(data, 0, data.Length);
        newStream.Close();

        responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());

        //And read the response 
        responseData = responseReader.ReadToEnd();
        responseReader.Close();
        buf = responseData;
    }
    catch (WebException ex)
    {
        if (ex.Status == WebExceptionStatus.ProtocolError)
        {
            Console.Write("The server returned protocol error ");
            // Get HttpWebResponse so that you can check the HTTP status code.
            HttpWebResponse httpResponse = (HttpWebResponse)ex.Response;
            int sc = (int)httpResponse.StatusCode;
            string strsc = httpResponse.StatusCode.ToString();
        }
    }
}

private static string ExtractViewState(string s)
{
    string viewStateNameDelimiter = "__VIEWSTATE";
    string valueDelimiter = "value=\"";

    int viewStateNamePosition = s.IndexOf(viewStateNameDelimiter);
    int viewStateValuePosition = s.IndexOf(valueDelimiter, viewStateNamePosition);

    int viewStateStartPosition = viewStateValuePosition + valueDelimiter.Length;
    int viewStateEndPosition = s.IndexOf("\"", viewStateStartPosition);

    return HttpUtility.UrlEncodeUnicode(s.Substring(viewStateStartPosition, viewStateEndPosition - viewStateStartPosition));
}

有人能指出我正确的方向吗?

3 个答案:

答案 0 :(得分:1)

这可能会或可能不会解决您的问题,因为当您说它不起作用时,我不确切地知道问题是什么。但正如“Al W”所指出的那样 - 来自异步回发的响应看起来不像是直接的HTML流。因此,如果您的问题在之后解析,那么这可能会有所帮助。

我最近有机会发现这个,因为我需要重写那个输出。我正在研究C# jQuery port并发现当我尝试在异步回发期间重新渲染输出流时,我正在破坏WebForms页面。我浏览了解析响应的客户端脚本,并找出了响应的格式。

每个更新的面板都会返回一个格式为:

的数据块

“长度|类型| ID |内容”

可能有许多这些串在一起。类型为UpdatePanels的“updatePanel”。 ID是控件的UniqueID,Content是实际的HTML数据。 Length等于Content中的字节数,您需要使用它来解析每个块,因为分隔符可能出现在Content本身内部。因此,如果您决定在将数据发送回ASP.NET页面之前重写这些数据(就像我一样),您需要更新长度以反映内容的最终长度。

我用来解析和重写它的代码在Server/CsQueryHttpContext

答案 1 :(得分:0)

对于POST操作,您希望它是UTF-8编码的,所以只需重新执行一行

        //var encoding = new ASCIIEncoding();
        //byte[] data = encoding.GetBytes(postData);
        //do this instead.....
        byte[] data = Encoding.UTF8.GetBytes(postData);

看看这是否有助于你

答案 2 :(得分:0)

以下是点击每月按钮时我在Chrome中获得的网络跟踪。请注意__EVENTTARGET:ctl00 $ plcMain $ monthBtn asp.net中有一个javascript框架,当单击该链接时调用javascript:postback()方法,该方法设置事件目标。这基本上是ASP.NET webforms知道在回发上触发哪个事件。这里有一个棘手的问题是网页正在使用更新面板,因此您可能无法获得真正的HTML响应。如果你能得到这样的请求,那么你应该得到一个成功的回应。希望这会有所帮助。

Request URL:http://www.wphospital.org/News-Events/Calendar-of-Events.aspx
Request Method:POST
Status Code:200 OK
Request Headers
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Cache-Control:no-cache
Content-Length:9718
Content-Type:application/x-www-form-urlencoded
Cookie:CMSPreferredCulture=en-US; ASP.NET_SessionId=h2nval45vq0q5yb0cp233huc; __utma=101137351.234148951.1312486481.1312486481.1312486481.1; __utmb=101137351.1.10.1312486481; __utmc=101137351; __utmz=101137351.1312486481.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __unam=ef169fe-131964a5f2a-24ec879b-1
Host:www.wphospital.org
Origin:http://www.wphospital.org
Proxy-Connection:keep-alive
Referer:http://www.wphospital.org/News-Events/Calendar-of-Events.aspx
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.124 Safari/534.30
X-MicrosoftAjax:Delta=true
Form Dataview URL encoded
ctl00$manScript:ctl00$plcMain$updTab|ctl00$plcMain$monthBtn
__EVENTTARGET:ctl00$plcMain$monthBtn
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:<removed for brevity>
lng:en-US
__EVENTVALIDATION:/wEWEgLbj/nSDgKt983zDgKWlOLbAQKr3LqFAwKL3uqpBwK9kfRnArDHltMCAuTk0eAHAsfniK0DAteIosMPAsiIosMPAsmIosMPAsuIosMPAoD0ookDApCbiOcPAo biOcPAombiOcPAoubiOcPyfqRx8FdqYzlnnkXcJEJZzzopJY=
ctl00$searchbox1$txtWord:Search
textfield2:Enter your search here
ctl00$plcMain$lstbxCategory:0
ctl00$plcMain$lstbxSubCategory:0
ctl00$plcMain$hdnEventCount:2