如何仅允许特定的HTML标记集和&使用通用正则表达式的特定属性集?
允许的HTML代码:
P |体| B | U | EM |强| UL |醇|李| H1 | H2 | H3 | H4 | H5 | H6 |小时| A | BR | IMG | TR | TD |表| TBODY |标签|格| SUP |子|字幕
允许的HTML属性:
ALT | HREF | tcmuri |标题|高度|宽度|对齐| VALIGN | rowspan的|列跨度| SRC |摘要|类| ID |名称|标题|目标| NOWRAP |范围|轴| CELLPADDING | CELLSPACING | DIR |朗|相对
为了测试此正则表达式,我使用 RegExr 网站。
((alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\s*=\s*["|']?[/.?=&#\w\s:;-]+["|']?)
<(?>/?)(?:[^p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption|P]|[p|cufontext|cufoncanvas|P][^\s>/])[^>]*>
我尝试合并这样的东西,但它没有正确过滤: -
<(?>/?)(?:[^p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption|P]|[p|cufontext|cufoncanvas|P]|((alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\s*=\s*["|']?[/.?=&#\w\s:;-]+["|']?)[^\s>/])[^>]*>
我的目的是只允许这组属性和HTML标记。
应删除其余的标签和属性,并留下内容。
INPUT HTML:
<h2 class="callout" cufid="2"><cufon style="width: 88px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 107px; height: 29px" path=" m39,-257 l75,-257,75,0,39,0,39,-257 x e m-41,-394 l2097,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m61,-157 c67,-174,93,-193,115,-192,142,-192,167,-170,166,-142 l166,0,131,0,131,-137 c134,-180,68,-170,61,-142 l61,0,27,0,26,-189,61,-189,61,-157 x e m-144,-394 l1994,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-144,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-326,-394 l1812,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-326,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-427,-394 l1711,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-427,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m11,-94 c11,-144,47,-192,94,-192,146,-192,177,-149,177,-94,177,-44,141,2,94,2,41,1,11,-39,11,-94 x m93,-178 c29,-172,34,-21,93,-14,155,-20,157,-172,93,-178 x e m-548,-394 l1590,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-548,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m15,-86 c15,-142,46,-192,95,-192,114,-192,127,-185,133,-174 l133,-257,168,-257,168,-34 c154,-10,130,2,95,2,52,2,15,-42,15,-86 x m134,-153 c128,-167,117,-177,98,-178,68,-178,54,-147,54,-86,54,-24,94,2,133,-24 x e m-728,-394 l1410,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-728,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m62,-51 c61,-9,125,-16,132,-47 l132,-189,166,-189,166,0,132,0,132,-28 c125,-10,106,1,81,2,-2,5,36,-116,28,-189 l62,-189,62,-51 x e m-909,-394 l1229,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-909,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m150,-6 c86,20,16,-20,16,-89,16,-163,78,-215,150,-182 l150,-158 c112,-211,48,-154,55,-94,49,-36,110,12,149,-31 x e m-1093,-394 l1045,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1093,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-1248,-394 l890,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1248,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m61,-157 c67,-174,93,-193,115,-192,142,-192,167,-170,166,-142 l166,0,131,0,131,-137 c134,-180,68,-170,61,-142 l61,0,27,0,26,-189,61,-189,61,-157 x e m-1336,-394 l802,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1336,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m15,-88 c6,-164,80,-221,134,-175 l134,-189,168,-189 c159,-82,207,86,87,80,64,79,45,75,31,66 l31,39 c68,87,150,59,134,-18,94,37,6,-25,15,-88 x m96,-178 c35,-178,36,-9,106,-14,119,-15,128,-21,133,-31 l133,-156 c121,-171,108,-178,96,-178 x e m-1518,-394 l620,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1518,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m-1693,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1693,-394" coordsize="2138,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 45px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 65px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-41,-394 l1256,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m44,-175 c74,-204,147,-194,147,-143 l147,0,112,0,112,-23 c96,12,7,13,14,-45,18,-86,44,-97,87,-114,126,-130,124,-173,84,-175,66,-175,53,-166,44,-149 l44,-175 x m112,-116 c94,-97,41,-84,47,-43,52,-8,96,-9,112,-31 l112,-116 x e m-201,-394 l1096,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-201,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m13,-36 c28,-4,92,-11,88,-53,84,-91,9,-100,12,-147,15,-193,77,-204,110,-176 l110,-152 c100,-176,50,-189,45,-156,50,-111,121,-110,121,-59,121,-6,50,20,13,-12 l13,-36 x e m-360,-394 l937,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-360,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-483,-394 l814,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-483,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-571,-394 l726,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-571,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-731,-394 l566,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-731,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m-852,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-852,-394" coordsize="1297,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 45px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 65px; height: 29px" path=" m85,-32 c85,-74,123,-134,113,-189 l148,-189,187,-33 c191,-84,214,-142,226,-189 l254,-189 c238,-128,211,-64,202,0 l160,0,131,-118 c121,-77,104,-37,100,0 l59,0,7,-189,42,-189 x e m-41,-394 l1251,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m11,-94 c11,-144,47,-192,94,-192,146,-192,177,-149,177,-94,177,-44,141,2,94,2,41,1,11,-39,11,-94 x m93,-178 c29,-172,34,-21,93,-14,155,-20,157,-172,93,-178 x e m-292,-394 l1000,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-292,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-472,-394 l820,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-472,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m60,0 l27,0,27,-257,60,-257,60,0 x e m-593,-394 l699,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-593,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m15,-86 c15,-142,46,-192,95,-192,114,-192,127,-185,133,-174 l133,-257,168,-257,168,-34 c154,-10,130,2,95,2,52,2,15,-42,15,-86 x m134,-153 c128,-167,117,-177,98,-178,68,-178,54,-147,54,-86,54,-24,94,2,133,-24 x e m-666,-394 l626,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-666,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m-847,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-847,-394" coordsize="1292,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 44px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 64px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-41,-394 l1226,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-142,-394 l1125,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-142,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m44,-175 c74,-204,147,-194,147,-143 l147,0,112,0,112,-23 c96,12,7,13,14,-45,18,-86,44,-97,87,-114,126,-130,124,-173,84,-175,66,-175,53,-166,44,-149 l44,-175 x m112,-116 c94,-97,41,-84,47,-43,52,-8,96,-9,112,-31 l112,-116 x e m-263,-394 l1004,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-263,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m95,-27 c108,-97,122,-119,145,-189 l174,-189 c155,-128,125,-66,113,0 l70,0,7,-189,41,-189 x e m-422,-394 l845,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-422,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-589,-394 l678,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-589,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m60,0 l27,0,27,-257,60,-257,60,0 x e m-749,-394 l518,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-749,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m-822,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-822,-394" coordsize="1267,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 36px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 55px; height: 29px" path=" m85,-32 c85,-74,123,-134,113,-189 l148,-189,187,-33 c191,-84,214,-142,226,-189 l254,-189 c238,-128,211,-64,202,0 l160,0,131,-118 c121,-77,104,-37,100,0 l59,0,7,-189,42,-189 x e m-41,-394 l1063,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-292,-394 l812,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-292,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-376,-394 l728,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-376,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m61,-160 c90,-207,171,-202,171,-141 l171,2,136,2,136,-140 c133,-179,79,-176,61,-145 l61,2,26,2,26,-257,61,-257,61,-160 x e m-477,-394 l627,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-477,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m-659,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-659,-394" coordsize="1104,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 60px; height: 18px" class="cufon cufon-vml" alt="Lorem ipsum"><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 78px; height: 29px" path=" m185,-35 c185,-12,172,0,146,0 l27,0,27,-257 c84,-255,173,-268,179,-221,169,-240,103,-239,63,-238 l63,-163 c102,-164,138,-164,148,-136,141,-144,90,-143,63,-143 l63,-20 c103,-22,170,-12,185,-35 x e m-41,-394 l1514,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m160,-158 c176,-210,259,-200,259,-141 l259,0,225,0,225,-138 c224,-175,174,-179,161,-142 l161,0,126,0,126,-139 c127,-180,62,-176,62,-141 l62,0,27,0,27,-189,62,-189,62,-158 c73,-198,152,-205,160,-158 x e m-217,-394 l1338,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-217,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-492,-394 l1063,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-492,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-569,-394 l986,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-569,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m44,-175 c74,-204,147,-194,147,-143 l147,0,112,0,112,-23 c96,12,7,13,14,-45,18,-86,44,-97,87,-114,126,-130,124,-173,84,-175,66,-175,53,-166,44,-149 l44,-175 x m112,-116 c94,-97,41,-84,47,-43,52,-8,96,-9,112,-31 l112,-116 x e m-690,-394 l865,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-690,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-849,-394 l706,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-849,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-950,-394 l605,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-950,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m13,-36 c28,-4,92,-11,88,-53,84,-91,9,-100,12,-147,15,-193,77,-204,110,-176 l110,-152 c100,-176,50,-189,45,-156,50,-111,121,-110,121,-59,121,-6,50,20,13,-12 l13,-36 x e m-1110,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1110,-394" coordsize="1555,577"></cvml:shape></cufoncanvas><cufontext>Lorem ipsum</cufontext><cvml:shape coordsize="1000,1000"></cvml:shape></cufon></h2>
<div class="contentContainer">
<p>Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
<p>Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
</div>
预期过滤输出
<h2>Lorem Lorem Lorem Lorem Lorem Lorem ipsum</h2>
<div>
<p>Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
<p>Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
</div>
再举一个例子,为了更清楚: -
输入
<abc id="test">new tag and known attribute</abc>
<a id="test" href="http://www.google.com/" xyz="testattr">known tag, attribute and one unknown attr</a>
输出
<a id="test" href="http://www.google.com/">known tag, attribute and one unknown attr</a>
感谢您的帮助。
答案 0 :(得分:2)
这是使用PCRE兼容的正则表达式的Perl解决方案。它不知道评论,doctype,CDATA等。应该添加这些内容以获得更完整的解决方案。
# allowed tag and attribute names
my $allowed_tags_open = 'p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|a|tr|td|table|tbody|label|div|sup|sub|caption';
my $allowed_tags_self_closing = 'img|br|hr';
my $allowed_attributes = 'alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel';
$allowed_attributes .= '|style'; # for testing
# definitions for matching allowed tag and attribute names
my $re_tags = qr~(?(DEFINE)
(?<tags_open>
/?+
(?>
(?: $allowed_tags_open )
(?! [^\s>/] ) # from (?&tagname)
)
)
(?<tags_self_closing>
(?>
(?: $allowed_tags_self_closing )
(?! [^\s>/] ) # from (?&tagname)
)
)
(?<tags> (?> (?&tags_open) | (?&tags_self_closing) ) )
(?<attribs>
(?>
(?: $allowed_attributes )
(?! [^\s=/>] ) # from (?&attname)
)
)
)~xi;
# definitions for matching the tags
# trying to follow compatible tokenization characteristics of modern browsers
my $re_defs = qr~(?(DEFINE)
(?<tagname> [a-z/][^\s>/]*+ ) # will match the leading / in closing tags
(?<attname> [^\s>/][^\s=/>]*+ ) # first char can be pretty much anything, including =
(?<attval> (?>
"[^"]*+" |
\'[^\']*+\' |
[^\s>]*+ # unquoted values can contain quotes, = and /
)
)
(?<attrib> (?&attname)
(?: \s*+
= \s*+
(?&attval)
)?+
)
(?<crap> (?!/>)[^\s>] ) # most crap inside tag is ignored, but don't eat the last / in self closing tags
(?<tag> <(?&tagname)
(?: \s*+ # spaces between attributes not required: <b/foo=">"style=color:red>bold red text</b>
(?>
(?&attrib) | # order matters
(?&crap) # if not an attribute, eat the crap
)
)*+
\s*+ /?+
>
)
)~xi;
sub sanitize_html{
my $str = shift;
$str =~ s/(?&tag) $re_defs/ sanitize_tag($&) /gexo;
return $str;
}
sub sanitize_tag{
my $tag = shift;
my ($name, $attr, $end) =
$tag =~ /^ < ((?&tags)) (.*?) ( \/?+ > ) $ $re_tags/xo
or return ''; # return empty string if not allowed tag
# return a new clean closing tag if it's a closing tag
return "<$name>" if substr($name, 0, 1) eq '/';
# clean attributes
return "<$name" . sanitize_attributes($attr) . $end;
}
sub sanitize_attributes{
my $attr = shift;
my $new = '';
$attr =~ s{
\G
\s*+ # spaces between attributes not required
(?>
( (?&attrib) ) | # order matters
(?&crap) # if not an attribute, eat the crap
)
$re_defs
}{
my $att = $1;
$new .= " $att" if $att && $att =~ /^(?&attribs) $re_tags/xo;
'';
}gexo;
return $new;
}
my $test = <<'_TEST_';
<b>simple</b>
self <img>closing</img>
<abc id="test">new tag and known attribute</abc>
<a id="test" xyz="testattr" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a id="foo">
<b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
<b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold
_TEST_
print $test, "\n";
print '-' x 70, "\n";
print sanitize_html $test;
输出:
<b>simple</b>
self <img>closing</img>
<abc id="test">new tag and known attribute</abc>
<a id="test" xyz="testattr" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a id="foo">
<b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
<b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold
----------------------------------------------------------------------
<b>simple</b>
self <img>closing
new tag and known attribute
<a id="test" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a>
<b>crap be gone</b> not bold<br/>
<b style=color:red;background:url("x.gif");/*="still.CSS*/ id="x" class="x">tricky</b> not bold
了解您的浏览器如何解析棘手的标记:jsFiddle
可能相关:
答案 1 :(得分:1)
这似乎与我发布的一段时间非常相似:
答案 2 :(得分:0)
<强> You can't parse HTML with regex 强> (这就是为什么那是Stackoverflow上最受欢迎的帖子之一)
答案 3 :(得分:-1)
最后,我分两步完成了这项工作: -
//Allowed list of HTML Tags
<(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>
//Allowed list of HTML Attributes
\s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel))\w+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?)
使用上面的两个正则表达式,我已经过滤了整个html。
现在我已将其缩减为一个正则表达式,它会过滤所有必需的HTML标记&amp;属性
(<(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>)|(\s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\b)[\w:]+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?))