C#正则表达式问题获取URL

时间:2011-07-08 00:52:15

标签: c# .net regex

简单解释一下,我正在尝试使用关键字搜索Google,然后获取前10个结果的网址并保存。

这是代码的精简命令行版本。它至少应该返回1个结果。如果它适用于此,我可以将它应用于我的完整版代码并获得所有结果。

基本上我现在的代码,如果我试图获得整个Google的来源,它就会失败。如果我从Google的HTML源代码中随机添加代码部分,则可以正常使用。对我来说,这意味着我的正则表达式在某处出现了错误。

如果除了正则表达式之外还有更好的方法,请告诉我。网址介于<h3 class="r"><a href="" class=l onmousedown="return clk(this.href

之间

我从生成器获得了这个正则表达式代码,但是我很难理解正则表达式,因为我读过的任何内容都没有清楚地解释它。如果有人能够找出错误并解释原因,我会非常感激。

谢谢, 凯文

using System;
using System.Text.RegularExpressions;
using System.Net;

namespace ConsoleApplication1
{
    class Program
    {
    static void Main(string[] args)
    {
        WebClient wc = new WebClient();
        string keyword = "seo nj";

        string html = wc.DownloadString(String.Format("http://www.google.com/search?q={0}", keyword));

        string re1 = "(<)"; // Any Single Character 1
        string re2 = "(h3)";    // Alphanum 1
        string re3 = "(\\s+)";  // White Space 1
        string re4 = "(class)"; // Variable Name 1
        string re5 = "(=)"; // Any Single Character 2
        string re6 = "(\"r\")"; // Double Quote String 1
        string re7 = "(>)"; // Any Single Character 3
        string re8 = "(<)"; // Any Single Character 4
        string re9 = "([a-z])"; // Any Single Word Character (Not Whitespace) 1
        string re10 = "(\\s+)"; // White Space 2
        string re11 = "((?:[a-z][a-z]+))";  // Word 1
        string re12 = "(=)";    // Any Single Character 5
        string re13 = ".*?";    // Non-greedy match on filler
        string re14 = "((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))";   // HTTP URL 1
        string re15 = "(\")";   // Any Single Character 6
        string re16 = "(\\s+)"; // White Space 3
        string re17 = "(class)";    // Word 2
        string re18 = "(=)";    // Any Single Character 7
        string re19 = "(l)";    // Any Single Character 8
        string re20 = "(\\s+)"; // White Space 4
        string re21 = "(onmousedown)";  // Word 3
        string re22 = "(=)";    // Any Single Character 9
        string re23 = "(\")";   // Any Single Character 10
        string re24 = "(return)";   // Word 4
        string re25 = "(\\s+)"; // White Space 5
        string re26 = "(clk)";  // Word 5

        Regex r = new Regex(re1 + re2 + re3 + re4 + re5 + re6 + re7 + re8 + re9 + re10 + re11 + re12 + re13 + re14 + re15 + re16 + re17 + re18 + re19 + re20 + re21 + re22 + re23 + re24 + re25 + re26, RegexOptions.IgnoreCase | RegexOptions.Singleline);
        Match m = r.Match(txt);
        if (m.Success)
        {
            Console.WriteLine("Good");
            String c1 = m.Groups[1].ToString();
            String alphanum1 = m.Groups[2].ToString();
            String ws1 = m.Groups[3].ToString();
            String var1 = m.Groups[4].ToString();
            String c2 = m.Groups[5].ToString();
            String string1 = m.Groups[6].ToString();
            String c3 = m.Groups[7].ToString();
            String c4 = m.Groups[8].ToString();
            String w1 = m.Groups[9].ToString();
            String ws2 = m.Groups[10].ToString();
            String word1 = m.Groups[11].ToString();
            String c5 = m.Groups[12].ToString();
            String httpurl1 = m.Groups[13].ToString();
            String c6 = m.Groups[14].ToString();
            String ws3 = m.Groups[15].ToString();
            String word2 = m.Groups[16].ToString();
            String c7 = m.Groups[17].ToString();
            String c8 = m.Groups[18].ToString();
            String ws4 = m.Groups[19].ToString();
            String word3 = m.Groups[20].ToString();
            String c9 = m.Groups[21].ToString();
            String c10 = m.Groups[22].ToString();
            String word4 = m.Groups[23].ToString();
            String ws5 = m.Groups[24].ToString();
            String word5 = m.Groups[25].ToString();
            //Console.Write("(" + c1.ToString() + ")" + "(" + alphanum1.ToString() + ")" + "(" + ws1.ToString() + ")" + "(" + var1.ToString() + ")" + "(" + c2.ToString() + ")" + "(" + string1.ToString() + ")" + "(" + c3.ToString() + ")" + "(" + c4.ToString() + ")" + "(" + w1.ToString() + ")" + "(" + ws2.ToString() + ")" + "(" + word1.ToString() + ")" + "(" + c5.ToString() + ")" + "(" + httpurl1.ToString() + ")" + "(" + c6.ToString() + ")" + "(" + ws3.ToString() + ")" + "(" + word2.ToString() + ")" + "(" + c7.ToString() + ")" + "(" + c8.ToString() + ")" + "(" + ws4.ToString() + ")" + "(" + word3.ToString() + ")" + "(" + c9.ToString() + ")" + "(" + c10.ToString() + ")" + "(" + word4.ToString() + ")" + "(" + ws5.ToString() + ")" + "(" + word5.ToString() + ")" + "\n");
            Console.WriteLine(httpurl1);
        }
        else
        {
            Console.WriteLine("Bad");
        }
        Console.ReadLine();
    }
}
}

2 个答案:

答案 0 :(得分:1)

你做错了。

Google有API以编程方式进行搜索。当你已经有一种已发布的,支持的方式来做你想做的事情时,不要让自己经历试图用正则表达式解析HTML的痛苦。

此外,您正在尝试做的事情 - 通过Google网站提交自动搜索并抓取结果 - 违反了Terms of Service的第5.3条:

  

您明确同意不通过任何自动方式(包括使用脚本或网络抓取工具)访问(或尝试访问)任何服务

答案 1 :(得分:0)

使用RegEx解析HTML是受虐狂。

请尝试使用HTML Agility Pack。它将允许您解析HTML。有关使用它的示例,请参阅此question