正则表达式从网页中提取Favicon网址

时间:2011-07-02 09:23:39

标签: c# html regex favicon

请帮助我使用正则表达式从下面的示例html中找到Favicon网址。它还应检查文件扩展名“.ico”。我正在开发一个个人书签网站,我想保存我书签的链接的favicon。我已经编写了c#代码将图标转换为gif并保存,但我对正则表达式的知识非常有限,所以我无法选择此标记,因为结束标记在不同的网站中是不同的。结束标记“/>”的示例“/&链接gt;” 中

我的编程语言是C#

<meta name="description" content="Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service." />
<meta name="robots" content="index, follow" />
<meta name="verify-v1" content="x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=" />
<link rel="shortcut icon" href="http://3dbin.com/favicon.ico" type="image/x-icon" />
<link rel="stylesheet" type="text/css" href="http://3dbin.com/css/1261391049/style.min.css" />
<!--[if lt IE 8]>
    <script src="http://3dbin.com/js/1261039165/IE8.js" type="text/javascript"></script>
<![endif]-->

解决方案:还有一种方法 下载并添加对htmlagilitypack dll的引用。谢谢你的帮助。我真的很喜欢这个网站:)

 HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(readcontent);

    if (doc.DocumentNode != null)
    {
        foreach (HtmlNode link in doc.DocumentNode.SelectNodes(@"//link[@href]"))
        {

            HtmlAttribute att = link.Attributes["href"];
            if (att.Value.EndsWith(".ico"))
            {
                faviconurl = att.Value;
            }
        }
    }

4 个答案:

答案 0 :(得分:1)

<link\s+[^>]*(?:href\s*=\s*"([^"]+)"\s+)?rel\s*=\s*"shortcut icon"(?:\s+href\s*=\s*"([^"]+)")?
也许......它不健壮,但可以奏效。 (我使用了perl regex)

答案 1 :(得分:1)

这应匹配包含href = http://3dbin.com/favicon.ico

的整个链接标记
 <link .*? href="http://3dbin\.com/favicon\.ico" [^>]* />

根据您的评论进行更正:

我看到你有一个C#解决方案很棒!但是,如果你仍然想知道是否可以用正则表达式完成,下面的表达式会做你想要的。比赛的第1组将只有网址。

 <link .*? href="(.*?.ico)"

使用它的简单C#snipet:

// this is the snipet from your example with an extra link item in the form <link ... href="...ico" > ... </link> 
//just to make sure it would pick it up properly.
String htmlText = String htnlText = "<meta name=\"description\" content=\"Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service.\" /><meta name=\"robots\" content=\"index, follow\" /><meta name=\"verify-v1\" content=\"x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=\" /><link rel=\"shortcut icon\" href=\"http://3dbin.com/favicon.ico\" type=\"image/x-icon\" /><link rel=\"shortcut icon\" href=\"http://anotherURL/someicofile.ico\" type=\"image/x-icon\">just to make sure it works with different link ending</link><link rel=\"stylesheet\" type=\"text/css\" href=\"http://3dbin.com/css/1261391049/style.min.css\" /><!--[if lt IE 8]>    <script src=\"http://3dbin.com/js/1261039165/IE8.js\" type=\"text/javascript\"></script><![endif]-->";

foreach (Match match in Regex.Matches(htmlText, "<link .*? href=\"(.*?.ico)\""))
{
    String url = match.Groups[1].Value;

    Console.WriteLine(url);
}

将以下内容输出到控制台:

http://3dbin.com/favicon.ico
http://anotherURL/someicofile.ico

答案 2 :(得分:1)

这不是正则表达式的工作,因为您将看到在StackOverflow上花费2分钟来查找如何解析HTML。

Use an HTML parser instead!

这是Python中的一个简单示例(我确信这在C#中同样可行):

% python
Python 2.7.1 (r271:86832, May 16 2011, 19:49:41) 
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen('https://stackoverflow.com/')
>>> soup = BeautifulSoup(page)
>>> link = soup.html.head.find(lambda x: x.name == 'link' and x['rel'] == 'shortcut icon')
>>> link['href']
u'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'
>>> link['href'].endswith('.ico')
True

答案 3 :(得分:1)

我在回来的时候就去了,所以这里很简单。首先,它尝试查找/favicon.ico文件。如果失败,我使用Html Agility pack加载页面,然后使用xpath查找任何标记。我遍历链接标记以查看它们是否具有rel ='icon'属性。如果他们这样做,我抓住href属性并展开它,如果它存在于该站点的绝对URL中。

请随意玩这个并提供任何改进。

private static Uri GetFaviconUrl(string siteUrl)
{
    // try looking for a /favicon.ico first
    var url = new Uri(siteUrl);
    var faviconUrl = new Uri(string.Format("{0}://{1}/favicon.ico", url.Scheme, url.Host));
    try
    {
        using (var httpWebResponse = WebRequest.Create(faviconUrl).GetResponse() as HttpWebResponse)
        {
            if (httpWebResponse != null && httpWebResponse.StatusCode == HttpStatusCode.OK)
            {
                // Log("Found a /favicon.ico file for {0}", url);
                return faviconUrl;
            }
        }
    }
    catch (WebException)
    {
    }

    // otherwise parse the html and look for <link rel='icon' href='' /> using html agility pack
    var htmlDocument = new HtmlWeb().Load(url.ToString());
    var links = htmlDocument.DocumentNode.SelectNodes("//link");
    if (links != null)
    {
        foreach (var linkTag in links)
        {
            var rel = GetAttr(linkTag, "rel");
            if (rel == null)
                continue;

            if (rel.Value.IndexOf("icon", StringComparison.InvariantCultureIgnoreCase) > 0)
            {
                var href = GetAttr(linkTag, "href");
                if (href == null)
                    continue;

                Uri absoluteUrl;
                if (Uri.TryCreate(href.Value, UriKind.Absolute, out absoluteUrl))
                {
                    // Log("Found an absolute favicon url {0}", absoluteUrl);
                    return absoluteUrl;
                }

                var expandedUrl = new Uri(string.Format("{0}://{1}{2}", url.Scheme, url.Host, href.Value));
                //Log("Found a relative favicon url for {0} and expanded it to {1}", url, expandedUrl);
                return expandedUrl;
            }
        }
    }

    // Log("Could not find a favicon for {0}", url);
    return null;
}

public static HtmlAttribute GetAttr(HtmlNode linkTag, string attr)
{
    return linkTag.Attributes.FirstOrDefault(x => x.Name.Equals(attr, StringComparison.InvariantCultureIgnoreCase));
}