如何使用HTML敏捷包来抓取内容

时间:2013-12-18 06:36:01

标签: c# html

我是HTML敏捷包的全新手,如何在C#中使用HTML敏捷包获取这些内容(代理)。

我的代码:

string url = "http://www.proxybase.de/";
        HtmlWeb web = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = web.Load(url);
        var nodes = doc.DocumentNode.SelectNodes("//table[@border='0' and @cellspacing='0' and @cellpadding='0']");

        if (nodes != null)
        {
            foreach (HtmlNode item in nodes)
            {
                if (item != null)
                {
                    string s = item.InnerText;
                    listView1.Items.Add(s);
                }
            }
        }
        else 
        {
            MessageBox.Show("Nothing found");
        }

HTML看起来像...

<table border="0" cellpadding="0" cellspacing="0">
 <tbody>
   <tr>...</tr> //Ignore first one
   <tr>
     <td>...</td>
     <td style="padding-left:5px;border-left;1px solid #999;"> 123.45.678.90:80  </td>
     <td style="padding-left:5px;border-left;1px solid #999;">...</td>
   </tr>
 </tbody>
</table>

更新

  

如何使用SelectSingleNode选择索引数组的表数据?

2 个答案:

答案 0 :(得分:1)

我认为您需要将网站信息(例如IP地址等)存储到文件或数据库中

如果以上情况属实,你几乎就在那里: 这应该解决它:

    string url = "http://www.proxybase.de/";
    HtmlWeb web = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = web.Load(url);
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[@style='padding-left:5px;border-left;1px solid #999;'"))
    {
        String s =  HtmlNode.InnerText;
        //Now the IP address is stored in s.
        //You can either put it in a file/database or a webpage :)
    } 

答案 1 :(得分:0)

HtmlWeb hw = new HtmlWeb();
        hw.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)";
        hw.PreRequest = new HtmlAgilityPack.HtmlWeb.PreRequestHandler(p.ProxyOnPreRequest); // this is proxy request
        HtmlAgilityPack.HtmlDocument doc = hw.Load(openUrl);

    public bool ProxyOnPreRequest(HttpWebRequest request)
    {
        WebProxy myProxy = new WebProxy("203.189.134.17:80");
        request.Proxy = myProxy;
        return true; // ok, go on
    }