C#Screen Scrape问题

时间:2014-10-31 16:53:06

标签: c# webclient screen-scraping

我正在尝试为证书验证网站编写屏幕抓取工具。 (here)我在发布到表单后收到“发生错误”消息。我以前做过其中的一些,我通常可以找到问题。这一次比较困难。有什么想法吗?

我尝试过的事情:

  • 在名称/值对上添加/删除HttpUtility.UrlEncode。这似乎并不重要。
  • 使用HttpWebRequest。我过去也成功地使用了它,但它基本上做同样的事情。

请注意,我已尝试通过将URL字符串放入代码来简化代码。我希望我没有介绍语法错误。

我正在使用WebClient的自定义版本。该类位于代码的底部。

client = new CookieAwareWebClient();

client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36");

// Load the initial page to get the cookie and viewstate
response = client.DownloadData("http://www.asha.org/eweb/ashadynamicpage.aspx?site=ashacms&webcode=ccchome");
postResult = System.Text.Encoding.UTF8.GetString(response);

start = postResult.IndexOf("__VIEWSTATE");
start = postResult.IndexOf("value=\"", start) + 7;
String viewstate = postResult.Substring(start, postResult.IndexOf("\"", start) - start).Trim();

// post the viewstate and license to the form
client.Headers.Add("Content-Type", "application/x-www-form-urlencoded");

response = client.UploadValues("http://www.asha.org/eweb/ashadynamicpage.aspx?site=ashacms&webcode=ccchome", "POST", new NameValueCollection()
   {
      { "__APPLICATIONPATH", HttpUtility.UrlEncode("/eWeb") },
      { "__EVENTTARGET", "" },
      { "__EVENTARGUMENT", "" },
      { "__LASTFOCUS", "" },
      { "__VIEWSTATE", HttpUtility.UrlEncode(viewstate) },
      { "__VIEWSTATEGENERATOR", HttpUtility.UrlEncode("28D90D6C") },
      { HttpUtility.UrlEncode("ab88a3d7_f50d_49d8_9557_30ffe94f63e8$txtAccountNumber"), HttpUtility.UrlEncode("12156995") },
      { HttpUtility.UrlEncode("ab88a3d7_f50d_49d8_9557_30ffe94f63e8$btnSearchByAccountNumber"), HttpUtility.UrlEncode("Submit") },
      { HttpUtility.UrlEncode("ab88a3d7_f50d_49d8_9557_30ffe94f63e8$txtLastName"), "" },
      { HttpUtility.UrlEncode("ab88a3d7_f50d_49d8_9557_30ffe94f63e8$txtFirstName"), "" },
      { HttpUtility.UrlEncode("ab88a3d7_f50d_49d8_9557_30ffe94f63e8$ddlCountry"), HttpUtility.UrlEncode("c7382b6c-7ada-4276-bc5c-e67f488981aa") },
      { HttpUtility.UrlEncode("ab88a3d7_f50d_49d8_9557_30ffe94f63e8$ddlStates"), HttpUtility.UrlEncode("AL") },
      { HttpUtility.UrlEncode("ab88a3d7_f50d_49d8_9557_30ffe94f63e8$hiddenStateName"), "" },
      { HttpUtility.UrlEncode("ab88a3d7_f50d_49d8_9557_30ffe94f63e8$hiddenStateCode"), "" }
  });

// Note: postResult will contain the error page
postResult = System.Text.Encoding.UTF8.GetString(response);

这是自定义Web客户端。它处理cookie。我想我得到了代码here

class CookieAwareWebClient : WebClient
{
    public CookieAwareWebClient()
    {
        CookieContainer = new CookieContainer();
    }
    public CookieContainer CookieContainer { get; private set; }

    protected override WebRequest GetWebRequest(Uri address)
    {
        var request = (HttpWebRequest)base.GetWebRequest(address);
        request.CookieContainer = CookieContainer;
        return request;
    }
}

0 个答案:

没有答案