Question

从HTML样式字符串中提取属性/值对的正确的正则表达式构造（.NET风格）是什么，而忽略HTML实体？

margin-top:0pt;margin:0;color:#000000;margin-left:0;font-size:26pt;margin-bottom:3pt;line-height:1.15;page-break-after:avoid;font-family:&quot;Arial&quot;;orphans:2;widows:2;text-align:left;margin-right:0

在;然后在:上拆分最简单，但由于HTML实体包含分号，这会打破一些字符串。例如，实体可以存在于font-family样式属性中。

font-family:&quot;Arial&quot;;

样式字符串是隔离的（无style="）和单行。

最终，我将在这种安排中对他们进行正则表达式分组;

match:( 
    group:( style-attribute-name ) 
    group:( style-attribute-value ) 
    )

遍历组以创建字典（重复键被替换）。

我目前的Regex看起来像这样 -

\s*(?<attr>[^:\s]*)\s*:\s*(?<val>[^;]*)[;]\s*

当它碰到HTML实体时会导致错误匹配。

Answer 1

我更新了您的正则表达式，使用平衡组跳过;，前面是&。

这是正则表达式：
(?<attr>[^:\s]*)\s*:\s*(?<val>(?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+)(?:;|$)

演示here

注意：我大部分已经 [^;]* 替换 (?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+ val < em>来自你的正则表达式。

Answer 2

http://www.regextester.com https://www.mikesdotnetting.com/article/46/c-regular-expressions-cheat-sheet

这些帮助了我，当我在学校里乱搞正则表达式，而不是靠近我的计算机时，所以我不能轻易地为你编写它：/

希望它有所帮助！

如何使用Regex解析HTML STYLE属性？

2 个答案: