如何解析HTML文档以使用正则表达式查找og:image标记?

时间:2013-07-29 13:59:06

标签: c# regex html-parsing

我已经以编程方式下载了网页的内容并将其保存在字符串变量中。寻找“ og:image ”元标记内容网址的最佳方式是什么?

E.g。假设页面的视图源的片段如下所示:

<meta property="og:site_name" content="The Christian Science Monitor"  />
<meta property="og:type" content="article"  />
<meta property="og:url" content="http://www.csmonitor.com/Business/2013/0729/Cannes-jewel-heist-53-million-in-diamonds-jewels-stolen-from-hotel"  />
<meta property="og:description" content="Cannes jewel heist saw $53 million in diamonds and other precious gems stolen from a hotel on the French Riviera. The Cannes jewel heist is the latest in a series of several brazen jewelry thefts in Europe in recent years."  />
<meta property="og:image" content="http://www.csmonitor.com/var/ezflow_site/storage/images/media/content/2013/0729-jewels/16474969-1-eng-US/0729-jewels.jpg"  />
<meta property="og:title" content="Cannes jewel heist: $53 million in diamonds, jewels stolen from hotel"  />
<meta name="sailthru.author" content="Thomas Adamson"  />

我想提取“http://www.csmonitor.com/var/ezflow_site/storage/images/media/content/2013/0729-jewels/16474969-1-eng-US/0729-jewels.jpg”字符串,它是“og:image”标记的目标。

我可以在代码中构造一些逻辑来查找子字符串然后从那里获取它但是我想用正则表达式语法来完成这个:

List<Uri> links = new List<Uri>();
string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";

MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);

最后一个示例抓取网页源并提取所有图像标记。我想对og:image标签做同样的事情,但我对正则表达式并不是很精通。

1 个答案:

答案 0 :(得分:0)

我认为您不应该使用正则表达式,它可能会变得有点古怪,这取决于他们如何将其放入 html 中。例如,content= 可能在 property= 之前。 我确实使用了一些常规代码,我不想使用 html 或 xml 解析器插件。这是我最终做的事情。

Dictionary<string, string> metatags = new Dictionary<string, string>();
int TagStart,TagEnd;
string element;
int AttrStart, AttrEnd;
string PropVal,ContentVal;
TagStart = strIn.IndexOf("<meta", StringComparison.OrdinalIgnoreCase);
while(TagStart != -1) {
    TagEnd = strIn.IndexOf(">", TagStart + 1, StringComparison.OrdinalIgnoreCase);
    if (TagEnd != -1) {
        element = strIn.Substring(TagStart, TagEnd - TagStart + 1);
        //Console.WriteLine("\nPROCESSING META TAG: {0}",element);
        PropVal = null;
        ContentVal = null;

        // Get "property" attribute
        AttrStart = element.IndexOf("property=\"", StringComparison.OrdinalIgnoreCase);
        if (AttrStart != -1) {
            AttrStart = AttrStart + 10;
            AttrEnd = element.IndexOf("\"", AttrStart, StringComparison.OrdinalIgnoreCase);
            if(AttrEnd != -1) {
                PropVal = element.Substring(AttrStart, AttrEnd - AttrStart);
            }
        }
        // Get "content" attribute
        AttrStart = element.IndexOf("content=\"", StringComparison.OrdinalIgnoreCase);
        if(AttrStart != -1) {
            AttrStart = AttrStart + 9;
            AttrEnd = element.IndexOf("\"", AttrStart, StringComparison.OrdinalIgnoreCase);
            if(AttrEnd != -1) {
                ContentVal = element.Substring(AttrStart, AttrEnd - AttrStart);
            }
        }
        if (PropVal != null && ContentVal != null)
            metatags.Add(PropVal, ContentVal);

    }
    // go to next meta tag
    TagStart = strIn.IndexOf("<meta", TagStart + 1, StringComparison.OrdinalIgnoreCase);
}
Console.WriteLine("\nOG meta tags");
foreach(var item in metatags) {
    Console.WriteLine("KEY={0} VALUE={1}",item.Key,item.Value);
}