有人可以定义正则表达式以匹配以下html代码

时间:2010-10-29 02:39:05

标签: c# javascript .net asp.net regex

我正在做一些网页抓取,我正在寻找一些具有特定类名和标记的div元素。

这是我的目标,我必须在div中提取所有内容 s_specs_box s_box_4

有人可以提供.NET术语中的正则表达式(即可以直接传入Regex的构造函数)来匹配一个这样的div(如下所示)

<div class=\"s_specs_box s_box_4\"><h3>Display</h3><ul><li><strong><span class='s_tooltip_anchor'>Display:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Display</b> - Phone's main display</p></span></strong><ul>\n<li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Type:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Type</b> - Refers to the type of the display. There are four major display types: Greyscale, Black&White, LCD:STN-color and LCD:TFT-color</p></span></strong><ul><li>Color</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Technology:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Technology</b> - Refers to the type of the color displays. There are five major types: LCD, TFT, TFD, STN and OLED</p></span></strong><ul><li>Super AMOLED</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Size:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Size</b> - Refers to the width and the height of the display</p></span></strong><ul><li><span title='Big display' class=\"s_display_rating s_size_1 s_mr_5\"><span></span></span>480 x 800 pixels</li></ul>\n</li><li class='clear clearfix'><strong>Physical Size:</strong><ul><li>4.00 inches</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Colors:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Colors</b> - Shows the number of colors that the display supports</p></span></strong><ul><li>16 777 216</li></ul>\n</li><li class='clear clearfix'><strong>Touch Screen:</strong><ul>\n<li class='clear clearfix'><strong>Type:</strong><ul><li>Capacitive</li></ul>\n</li>\n</ul></li><li class='clear clearfix'><strong>Multi-touch:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Proximity Sensor:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Light sensor:</strong><ul><li>Yes</li></ul>\n</li>\n</ul></li></ul>\n</div>

提前致谢,

维杰

3 个答案:

答案 0 :(得分:4)

您无法使用正则表达式解析HTML。

相反,您应该使用C#中的HTML Agility Pack或Javascript中的jQuery

例如:

var html = document.DocumentNode.Descendants("div")
    .First(div => div.GetAttributeValue("class", null) == "s_specs_box s_box_4")
    .InnerHtml;

答案 1 :(得分:1)

好的,如果没有其他人想要直接链接这个以获得更好的描述,我会...(Altho @SLaks真的比这更好地帮助你了)

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

答案 2 :(得分:0)

这适用于您提供的示例数据:

string subject = "<div class=\"s_specs_box s_box_4\"><h3>Display</h3><ul><li><strong><span class='s_tooltip_anchor'>Display:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Display</b> - Phone's main display</p></span></strong><ul>\n<li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Type:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Type</b> - Refers to the type of the display. There are four major display types: Greyscale, Black&White, LCD:STN-color and LCD:TFT-color</p></span></strong><ul><li>Color</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Technology:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Technology</b> - Refers to the type of the color displays. There are five major types: LCD, TFT, TFD, STN and OLED</p></span></strong><ul><li>Super AMOLED</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Size:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Size</b> - Refers to the width and the height of the display</p></span></strong><ul><li><span title='Big display' class=\"s_display_rating s_size_1 s_mr_5\"><span></span></span>480 x 800 pixels</li></ul>\n</li><li class='clear clearfix'><strong>Physical Size:</strong><ul><li>4.00 inches</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Colors:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Colors</b> - Shows the number of colors that the display supports</p></span></strong><ul><li>16 777 216</li></ul>\n</li><li class='clear clearfix'><strong>Touch Screen:</strong><ul>\n<li class='clear clearfix'><strong>Type:</strong><ul><li>Capacitive</li></ul>\n</li>\n</ul></li><li class='clear clearfix'><strong>Multi-touch:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Proximity Sensor:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Light sensor:</strong><ul><li>Yes</li></ul>\n</li>\n</ul></li></ul>\n</div>";
Match match = Regex.Match(subject,
    @"<div[^>]+class\s*=\s*""s_specs_box s_box_4""[^>]*>(.*?)<\s*/\s*div\s*>",
    RegexOptions.Singleline);
Console.WriteLine(match.Success);
string result = match.Groups[1].Value;
Console.WriteLine(result);

免责声明1:不要使用正则表达式解析HTML。在匹配相同类型的嵌套标签时尤其糟糕。例如,如果您的主<div>有一个<div>孩子,我的代码几乎肯定不会产生您想要的结果。这不是使用正则表达式解析HTML的唯一问题,只是第一个问题。

免责声明2:不要使用正则表达式来解析生产代码中的HTML或未知的未来输入。如果您只是用它来批量转换硬盘上的几十个HTML文件,那就好了,您将手动验证结果。信任新的未知输入是不行的。