正则表达式提取所有图像和HTML

时间:2012-04-16 13:58:44

标签: c# wpf regex

我正在试图找出以下的正则表达式,似乎无法正确使用它。有人可以告诉我吗?

简而言之,我有一个htmlString:

        htmlString = "<HTML><HEAD></HEAD><BODY>Here are some images.</br>1) <IMG style='MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px' align=right src='images/sample001.jpg'>2) <IMG style='MARGIN-BOTTOM: 25px; MARGIN-LEFT: 25px' align=right src='images/sample002.png'></br> And some docs as well.</br>1) href='javascript:parent.POPUP({url:'testDoc001.htm',type:'shared',width:600,height:645})'></br>2) href='javascript:parent.POPUP({url:'testDoc002.html',type:'shared',width:700,height:712})'></br></BODY></HTML>";

我在C#,WPF中执行以下例程:


    private static List<string> ExtractData(string htmlString)
    {
        List<string> data = new List<string>();

        //***  Get The Images ***
        string pattern = @"<img .* src='(.+\.(jpg|bmp|png))'";

        Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
        MatchCollection matches = rgx.Matches(htmlString);

        for (int i = 0, l = matches.Count; i < l; i++)
        {
            data.Add(matches[i].Value);
        }

        //***  Get Html Pages ***
        pattern = @"url:'([^']*)'";

        rgx = new Regex(pattern, RegexOptions.IgnoreCase);
        matches = rgx.Matches(htmlString);

        for (int i = 0, l = matches.Count; i < l; i++)
        {
            data.Add(matches[i].Value);
        }

        return data;
    }--------------------------------------------------------------------------------------

我得到的结果是:

[0] =“&lt; IMG style ='MARGIN-BOTTOM:20px; MARGIN-LEFT:20px'align = right src ='images / sample001.jpg'&gt; 2)&lt; IMG style ='MARGIN- BOTTOM:25px; MARGIN-LEFT:25px'align = right src ='images / sample002.png'“

[1] =“url:'testDoc001.htm'”

[2] =“url:'testDoc002.html'”

我真正想要的是:

[0] =“images / sample001.jpg”

[1] =“images / sample002.png”

[2] =“testDoc001.htm”

[3] =“testDoc002.html”

有人可以告诉我在正则表达式中我做错了吗?

由于

1 个答案:

答案 0 :(得分:1)

您最好使用HTML Agility Pack进行此类工作。正如其他人所提到的,除了非常具体的情况之外,用于解析HTML的正则表达式是一个很好的事情。无论如何,你的正则表达式有几个问题。第一个应该类似于:

<img.+?src\s*=\s*\'(.*?\.(jpg|bmp|png))'