C#,如何在网站上使用正则表达式来抓取

时间:2018-01-27 16:00:39

标签: c# regex

单击我的button1时,它会运行此

 MatchCollection matchCollection = new Regex(@"(?<=/&gt;)\d+").Matches(new StreamReader(((HttpWebResponse)((HttpWebRequest)WebRequest.Create("http://www.proxyserverlist24.top/feeds/posts/default")).GetResponse()).GetResponseStream()).ReadToEnd());

基本上,它会进入http://www.proxyserverlist24.top/feeds/posts/default 并尝试在/%gt之间提取数字;和lt; br

  

/%gt; 103.12.161.1:65103%lt; br /%gt;103.16.61.134:8080%lt;br   /%gt;103.21.77.106:8080%lt;br

我该如何抓住这些数字?

1 个答案:

答案 0 :(得分:1)

无需正则表达式。您可以使用xml解析器(您的链接返回xml)和html解析器(HtmlAgilityPack)来解析“content”标记的文本。所以最终的代码是:

IPAddress tempip;
int port;
List<IPEndPoint> proxies = null;

using (var client = new HttpClient())
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    XNamespace ns = "http://www.w3.org/2005/Atom";
    var xml = await client.GetStringAsync("http://www.proxyserverlist24.top/feeds/posts/default");
    var xDoc = XDocument.Parse(xml);
    proxies = xDoc.Descendants(ns + "entry")
        .Select(x => (string)x.Element(ns + "content"))
        .SelectMany(x =>
        {
            doc.LoadHtml(x);
            return doc.DocumentNode.SelectNodes("//span[not(span)]")
                        .SelectMany(n => n.Descendants())
                        .Select(n => n.InnerText.Split(":".ToCharArray(), StringSplitOptions.RemoveEmptyEntries))
                        .Where(n => n.Length == 2)
                        .Where(n => IPAddress.TryParse(n[0], out tempip))
                        .Where(n => int.TryParse(n[1], out port))
                        .Select(n => new IPEndPoint(IPAddress.Parse(n[0]), int.Parse(n[1])));
        })
        .ToList();
}

事实上,一个较短的正则表达式解决方案也是可能的,但使用正则表达式解析xml或html并不是一个好主意,如评论中所述。