网站解析 - webbrowser或httpwebresponse

时间:2013-10-23 12:12:04

标签: c# html-parsing web-scraping webbrowser-control httpwebresponse

当我尝试从我的银行网站解析一些数据时遇到了一些困难。基本上,我想自动导出我的交易历史记录,但网上银行没有任何自动功能。 我目前正在尝试如何模拟填写表单和点击以进入下载页面并获取可用于解析的CSV文件。

我尝试了不同的方法但没有成功,请指引我走正确的道路。

 public static void getNABLogin()
    {
        try
        {
            Console.WriteLine("ENTER to begin");
            //Console.ReadLine();
            System.Net.HttpWebRequest wr = (System.Net.HttpWebRequest)System.Net.WebRequest.Create("https://ib.nab.com.au/nabib/index.jsp");
            wr.Timeout = 1000;
            wr.Method = "GET";
            wr.UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36";
            wr.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
            wr.Headers.Add("Accept-Language", "en-GB,en-US;q=0.8,en;q=0.6");
            wr.Headers.Add("Accept-Encoding", "gzip,deflate,sdch");
            //wr.Connection = "Keep-Alive";
            wr.Host = "ib.nab.com.au";
            wr.KeepAlive = true;

            wr.CookieContainer = new CookieContainer();

            //////////This part will get me to the correct login page at least////////////////////
            // System.IO.Stream objStreamReceive ;
            // System.Text.Encoding objEncoding;
            // System.IO.StreamReader objStreamRead;
            // WebResponse objResponse;
            //string strOutput = string.Empty;

            //objResponse = wr.GetResponse();
            //objStreamReceive = objResponse.GetResponseStream();
            //objEncoding = System.Text.Encoding.GetEncoding("utf-8");
            //objStreamRead = new StreamReader(objStreamReceive, objEncoding); // Set function return value
            //strOutput = objStreamRead.ReadToEnd();
            ///////////////////////////////
            System.Net.HttpWebResponse wresp = (System.Net.HttpWebResponse)wr.GetResponse();

            System.Windows.Forms.WebBrowser wb = new System.Windows.Forms.WebBrowser();

            wb.DocumentStream = wresp.GetResponseStream();
            wb.ScriptErrorsSuppressed = true;

           wb.DocumentCompleted += (sndr, e) =>
            {
                /////////////After dumping the document text into a text file, I get a different page/////////////////
                //////////////I get the normal website instead of login page////////////////////////
               System.IO.StreamWriter file = new System.IO.StreamWriter("C:\\temp\\test.txt");
               Console.WriteLine(wb.DocumentText);
               file.WriteLine(wb.DocumentText);
               System.Windows.Forms.HtmlDocument d = wb.Document;

               System.Windows.Forms.HtmlElementCollection ctrlCol = d.GetElementsByTagName("script");
               foreach (System.Windows.Forms.HtmlElement tag in ctrlCol)
               {
                   tag.SetAttribute("src", string.Format("https://ib.nab.com.au{0}", tag.GetAttribute("src")));
               }


               ctrlCol = d.GetElementsByTagName("input");
               foreach (System.Windows.Forms.HtmlElement tag in ctrlCol)
               {
                   if (tag.GetAttribute("name") == "userid")
                   {
                       tag.SetAttribute("value", "123456");
                   }
                   else if (tag.GetAttribute("name") == "password")
                   {
                       tag.SetAttribute("value", "nabPassword");
                   }
                   file.WriteLine(tag.GetAttribute("name"));
               }

               file.Close();
               // object y = wb.Document.InvokeScript("validateLogin");
            };

           while (wb.ReadyState != System.Windows.Forms.WebBrowserReadyState.Complete)
           {
               System.Windows.Forms.Application.DoEvents();
           }
        }
        catch(Exception e)
        {
            System.IO.StreamWriter file = new System.IO.StreamWriter("C:\\temp\\error.txt");
            file.WriteLine(e.Message);
            Console.WriteLine(string.Format("error: {0}", e.Message));
            Console.ReadLine();
        }

我从一个线程调用了这个方法(因为你可能知道webbrowser需要是STA线程才能工作)。 正如代码中所解释的,我使用httpwebresponse方法正确获取了登录页面。但是当我尝试使用文档流加载到webbrowser时,我进入了一个不同的网站。

接下来的问题是,在我到达登录页面后,我该怎么做呢?如何模拟点击和填写数据(我的理论目前是尝试使用httpwebrequest发布一些数据)。

请对此有所了解。任何评论或信息非常感谢。 非常感谢你提前。

1 个答案:

答案 0 :(得分:0)

您可以使用类似浏览器的selenium并转到您要去的地方并使用HtmlAgilityPack解析页面。两者都有c#支持。非常简单的控制台应用程序可以休息

<强>硒

http://www.seleniumhq.org/docs/02_selenium_ide.jsp#chapter02-reference

<强> HtmlAgilityPack https://htmlagilitypack.codeplex.com/wikipage?title=Examples

您可以使用selenium和c#

填写表格和帖子
//Navigate to the site
 driver.Navigate().GoToUrl("http://www.google.com.au");
 // Find the text input element by its name
 IWebElement query = driver.FindElement(By.Name("q"));
 // Enter something to search for
 query.SendKeys("Selenium");
 // Now submit the form
 query.Submit();
 // Google's search is rendered dynamically with JavaScript.
 // Wait for the page to load, timeout after 5 seconds
 WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(5));
 wait.Until((d) => { return d.Title.StartsWith("selenium"); });

您可以使用HtmlAgility

解析数据(此示例表)
var cols = doc.DocumentNode.SelectNodes("//table[@id='table2']//tr//td");
for (int ii = 0; ii < cols.Count; ii=ii+2)
{
    string name = cols[ii].InnerText.Trim();
    int age = int.Parse(cols[ii+1].InnerText.Split(' ')[1]);
}