Question

请，我需要一个正则表达式来删除所有表单标签。例如，如果在html文本中我有：

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Title appears in the browser's title bar...</title>    
<style type="text/css">
body {background-color:ffffff;background-image:url(http://);background-repeat:no-repeat;background-position:top left;background-attachment:fixed;}
h1{font-family:Cursive;color:000000;}
 p {font-family:Cursive;font-size:14px;font-style:normal;font-weight:normal;color:000000;}    
</style>    
</head>
<body>
<form name="fr">
<input name="ss" id="sss" value="as1">
</form>
<h1>Heading goes here...</h1>
<p>Enter your paragraph text here...</p>
</html>

我需要删除所有输入标记以获取：

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Title appears in the browser's title bar...</title>    
<style type="text/css">
body {background-color:ffffff;background-image:url(http://);background-repeat:no-repeat;background-position:top left;background-attachment:fixed;}
h1{font-family:Cursive;color:000000;}
 p {font-family:Cursive;font-size:14px;font-style:normal;font-weight:normal;color:000000;}    
</style>    
</head>
<body>
<form name="fr">
</form>
<h1>Heading goes here...</h1>
<p>Enter your paragraph text here...</p>
</html>

Answer 1

正则表达式无法处理无上下文的语法。它不能用于处理任意HTML。

您可以使用它来删除某些简单标记，即没有子标记的标记。但是，当遇到包含嵌套标记的html时，正则表达式会很快失败。

虽然您识别的三个标签中的两个（输入，选择，文本区域）通常没有嵌套标签，并且选择应该只有一个级别的标签，但您永远不能保证您不会遇到格式错误的HTML在它们下面有标签。

简短的回答是：不要使用正则表达式执行此任务，除非您完全确定输入的格式良好。

对于格式良好的输入（即它们也不得在引号内包含“＆lt;”和“＆gt;”字符）：

<input(\s+[^>]*)?>|
<textarea(\s+[^>]*)?>.*?</textarea(\s+[^>]*)?>|
<select(\s+[^>]*)?>(<option(\s+[^>]*)?>.*?</option(\s+[^>]*)?>)*</select(\s+[^>]*)?>

Answer 2

我不确定正则表达式是你最好的选择。请考虑以下javascript：

var container = document.getElementById("fr");

if ( container.hasChildNodes() )
{
    while ( container.childNodes.length >= 1 )
    {
        container.removeChild( getElementsByTagName("input") );       
    } 
}

Answer 3

假设：1。）HTML传递W3C验证器（HTML 4.01或XHTML 1.0，严格或过渡），并且：2。）没有<![CDATA[部分，HTML注释，脚本，标记属性或包含序列的样式：<FORM或</FORM，以及3.）没有短标签，那么下面的PHP脚本应该可以解决这个问题:(请注意，正则表达式被大量评论 - 因为所有好的非-trivial regexes应该！）

<?php // test.php 20110312_0000
$data = file_get_contents('valid_markup.html');

$re = '%# Match an HTML FORM element.
(                    # $1: Opening tag.
  <FORM\b            # Opening tag opening delimiter and element name.
  (?:                # Non-capture group for optional attribute(s).
    \s+              # Attributes must be separated by whitespace.
    [\w\-.:]+        # Attribute name is required for attr=value pair.
    (?:              # Non-capture group for optional attribute value.
      \s*=\s*        # Name and value separated by "=" and optional ws.
      (?:            # Non-capture group for attrib value alternatives.
        "[^"]*"      # Double quoted string.
      | \'[^\']*\'   # Single quoted string.
      | [\w\-.:]+\b  # Non-quoted attrib value can be A-Z0-9-._:
      )              # End of attribute value alternatives.
    )?               # Attribute value is optional.
  )*                 # Allow zero or more attribute=value pairs
  \s*                # Whitespace is allowed before closing delimiter.
  >                  # Opening tag closing ">" delimiter.
)                    # End $1: Opening tag.
(                    # $2: Tag contents.
  [^<]*              # Everything up to next tag. (normal*)
  (?:                # We found a tag (open or close).
    (?!</?FORM\b) <  # Not us? Match the "<". (special)
    [^<]*            # More of everything up to next tag. (normal*)
  )*                 # Unroll-the-loop. (special normal*)*
)                    # End $2. Tag contents.
(</FORM\s*>)         # $3: Closing tag.
        %ix';
$data = preg_replace($re, '$1$3', $data);
echo($data);
?>

P.S。在任何人 regexes-not-for-parsing 纯粹主义者判断此解决方案不合适之前，请提供一个示例（符合所述假设），证明这可能失败。或者向我展示更快的任何其他方法（正则表达式或其他方法）。（请不要扯我一个新的 - 我是新来的，不知道更好！）

正则表达式删除所有输入/ textarea / SELECT FROM html

3 个答案: