使用Regex查找Google源代码中的链接

时间:2012-12-11 16:47:49

标签: c# html regex

当您使用Regex搜索某些内容时,我正试图抓取Google在第一页上生成的10个网站的链接。我对Regex很陌生,并且在使用它时遇到了很多麻烦:

MatchCollection links = Regex.Matches(indexPage, @"<h3 class=""r""><a href=""\s*(.+?)\s*"" class=l", RegexOptions.Multiline);

一旦我在集合中有链接,我就将它们添加到列表中:

foreach (Match link in links) {
    string result = link.Groups[1].Value;
    results.Add(result);
}

没有找到任何链接,任何帮助都会非常感谢

1 个答案:

答案 0 :(得分:1)

找到所有网址:

    "#^((?#
    the scheme:
    )(?:https?://)(?#
    second level domains and beyond:
    )(?:[\S]+\.)+((?#
top level domains:
)MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|(?#
)COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|(?#
)A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|(?#
)C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|(?#
)E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|(?#
)H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|(?#
)K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|(?#
)N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|(?#
)S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|(?#
)U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(?#
the path, can be there or not:
)(/[a-z0-9\._/~%\-\+&\#\?!=\(\)@]*)?)$#i"