简单解释一下,我正在尝试使用关键字搜索Google,然后获取前10个结果的网址并保存。
这是代码的精简命令行版本。它至少应该返回1个结果。如果它适用于此,我可以将它应用于我的完整版代码并获得所有结果。
基本上我现在的代码,如果我试图获得整个Google的来源,它就会失败。如果我从Google的HTML源代码中随机添加代码部分,则可以正常使用。对我来说,这意味着我的正则表达式在某处出现了错误。
如果除了正则表达式之外还有更好的方法,请告诉我。网址介于<h3 class="r"><a href="
和" class=l onmousedown="return clk(this.href
我从生成器获得了这个正则表达式代码,但是我很难理解正则表达式,因为我读过的任何内容都没有清楚地解释它。如果有人能够找出错误并解释原因,我会非常感激。
谢谢, 凯文
using System;
using System.Text.RegularExpressions;
using System.Net;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
WebClient wc = new WebClient();
string keyword = "seo nj";
string html = wc.DownloadString(String.Format("http://www.google.com/search?q={0}", keyword));
string re1 = "(<)"; // Any Single Character 1
string re2 = "(h3)"; // Alphanum 1
string re3 = "(\\s+)"; // White Space 1
string re4 = "(class)"; // Variable Name 1
string re5 = "(=)"; // Any Single Character 2
string re6 = "(\"r\")"; // Double Quote String 1
string re7 = "(>)"; // Any Single Character 3
string re8 = "(<)"; // Any Single Character 4
string re9 = "([a-z])"; // Any Single Word Character (Not Whitespace) 1
string re10 = "(\\s+)"; // White Space 2
string re11 = "((?:[a-z][a-z]+))"; // Word 1
string re12 = "(=)"; // Any Single Character 5
string re13 = ".*?"; // Non-greedy match on filler
string re14 = "((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))"; // HTTP URL 1
string re15 = "(\")"; // Any Single Character 6
string re16 = "(\\s+)"; // White Space 3
string re17 = "(class)"; // Word 2
string re18 = "(=)"; // Any Single Character 7
string re19 = "(l)"; // Any Single Character 8
string re20 = "(\\s+)"; // White Space 4
string re21 = "(onmousedown)"; // Word 3
string re22 = "(=)"; // Any Single Character 9
string re23 = "(\")"; // Any Single Character 10
string re24 = "(return)"; // Word 4
string re25 = "(\\s+)"; // White Space 5
string re26 = "(clk)"; // Word 5
Regex r = new Regex(re1 + re2 + re3 + re4 + re5 + re6 + re7 + re8 + re9 + re10 + re11 + re12 + re13 + re14 + re15 + re16 + re17 + re18 + re19 + re20 + re21 + re22 + re23 + re24 + re25 + re26, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(txt);
if (m.Success)
{
Console.WriteLine("Good");
String c1 = m.Groups[1].ToString();
String alphanum1 = m.Groups[2].ToString();
String ws1 = m.Groups[3].ToString();
String var1 = m.Groups[4].ToString();
String c2 = m.Groups[5].ToString();
String string1 = m.Groups[6].ToString();
String c3 = m.Groups[7].ToString();
String c4 = m.Groups[8].ToString();
String w1 = m.Groups[9].ToString();
String ws2 = m.Groups[10].ToString();
String word1 = m.Groups[11].ToString();
String c5 = m.Groups[12].ToString();
String httpurl1 = m.Groups[13].ToString();
String c6 = m.Groups[14].ToString();
String ws3 = m.Groups[15].ToString();
String word2 = m.Groups[16].ToString();
String c7 = m.Groups[17].ToString();
String c8 = m.Groups[18].ToString();
String ws4 = m.Groups[19].ToString();
String word3 = m.Groups[20].ToString();
String c9 = m.Groups[21].ToString();
String c10 = m.Groups[22].ToString();
String word4 = m.Groups[23].ToString();
String ws5 = m.Groups[24].ToString();
String word5 = m.Groups[25].ToString();
//Console.Write("(" + c1.ToString() + ")" + "(" + alphanum1.ToString() + ")" + "(" + ws1.ToString() + ")" + "(" + var1.ToString() + ")" + "(" + c2.ToString() + ")" + "(" + string1.ToString() + ")" + "(" + c3.ToString() + ")" + "(" + c4.ToString() + ")" + "(" + w1.ToString() + ")" + "(" + ws2.ToString() + ")" + "(" + word1.ToString() + ")" + "(" + c5.ToString() + ")" + "(" + httpurl1.ToString() + ")" + "(" + c6.ToString() + ")" + "(" + ws3.ToString() + ")" + "(" + word2.ToString() + ")" + "(" + c7.ToString() + ")" + "(" + c8.ToString() + ")" + "(" + ws4.ToString() + ")" + "(" + word3.ToString() + ")" + "(" + c9.ToString() + ")" + "(" + c10.ToString() + ")" + "(" + word4.ToString() + ")" + "(" + ws5.ToString() + ")" + "(" + word5.ToString() + ")" + "\n");
Console.WriteLine(httpurl1);
}
else
{
Console.WriteLine("Bad");
}
Console.ReadLine();
}
}
}
答案 0 :(得分:1)
你做错了。
Google有API以编程方式进行搜索。当你已经有一种已发布的,支持的方式来做你想做的事情时,不要让自己经历试图用正则表达式解析HTML的痛苦。
此外,您正在尝试做的事情 - 通过Google网站提交自动搜索并抓取结果 - 违反了Terms of Service的第5.3条:
您明确同意不通过任何自动方式(包括使用脚本或网络抓取工具)访问(或尝试访问)任何服务
答案 1 :(得分:0)
使用RegEx解析HTML是受虐狂。
请尝试使用HTML Agility Pack。它将允许您解析HTML。有关使用它的示例,请参阅此question。