如何使用正则表达式匹配HTML中的charset字符串?

时间:2010-08-11 12:32:03

标签: html regex

HTML代码示例:

<meta http-equiv="Content-type" content="text/html;charset=utf-8" />

我想使用RegEx来提取字符集信息(即这里,它是“utf-8”)

(我正在使用C#)

9 个答案:

答案 0 :(得分:16)

我的回答提供了一个更强大的@ Floyd版本,并且在可能的情况下,解决了@ You的破损测试案例,其中使用负向前瞻来避免它。实际上只有一个我能想到的相关案例(@ You的例子的变体)会给出误报,但我认为这种情况非常罕见。表达式应使用不区分大小写的标志运行,并使用java.util.regexJRegex进行测试。

自动修剪捕获组并且不会包含引号,也不会包含其他标记字符,例如“/”或“&gt;”。在第二个表达式中,有2个捕获组;第一个是内容类型值,可能是空的(即,当使用字符集属性时),第二个是字符集值,它始终是非空的(除非字符集值由于某些奇怪的原因而实际上是空的)。

仅限匹配/分组字符集值的正则表达式 - 修剪,跳过引号

<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([^\s"'/>]*)

与上面相同,但也匹配/分组内容类型(可选)和charset(必需)值,修剪,跳过引号。次要警告 - 错过匹配独立内容类型值,即“text / html”

<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s"']*)?([^>]*?)[\s"';]*charset\s*=[\s"']*([^\s"'/>]*)

测试用例(除最后一个之外全部通过)......

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'/>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' />
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1/>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 />
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" >
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' >
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 >

<meta http-equiv="Content-Type" content="text/html;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";charset="iso-8859-1"'>

<meta http-equiv="Content-Type" content="text/html;;;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;;;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;;;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';;;charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;;;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;;;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;;;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";;;charset="iso-8859-1"'>

<meta  http-equiv  =  "  Content-Type  "  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  >
<meta  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  http-equiv  =  "  Content-Type  "  >
<meta  http-equiv  =  Content-Type  content  =  text/html;charset=iso-8859-1  >
<meta  content  =  text/html;charset=iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;;;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;;;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >

<meta charset="utf-8"/>
<meta charset="utf-8" />
<meta charset='utf-8'/>
<meta charset='utf-8' />
<meta charset=utf-8/>
<meta charset=utf-8 />
<meta charset="utf-8">
<meta charset="utf-8" >
<meta charset='utf-8'>
<meta charset='utf-8' >
<meta charset=utf-8>
<meta charset=utf-8 >

<meta  charset  =  "  utf-8  "  >
<meta  charset  =  '  utf-8  '  >
<meta  charset  =  "  utf-8  '  >
<meta  charset  =  '  utf-8  "  >
<meta  charset  =  "  utf-8     >
<meta  charset  =  '  utf-8     >
<meta  charset  =     utf-8  '  >
<meta  charset  =     utf-8  "  >
<meta  charset  =     utf-8     >
<meta  charset  =     utf-8    />

<meta name="title" value="charset=utf-8 — is it really useful (yep)?">
<meta value="charset=utf-8 — is it really useful (yep)?" name="title">
<meta name="title" content="charset=utf-8 — is it really useful (yep)?">
<meta name="charset=utf-8" content="charset=utf-8 — is it really useful (yep)?">

<meta content="charset=utf-8 — is it really useful (nope, not here, but gotta admit pretty robust otherwise)?" name="title">

答案 1 :(得分:8)

这个正则表达式:

<meta.*?charset=([^"']+)

应该有效。使用XML解析器提取此过度杀伤。

答案 2 :(得分:0)

我试着用javascript将你的字符串放在变量中并进行匹配:

var x = '<meta http-equiv="Content-type" content="text/html;charset=utf-8" />';
var result = x.match(/charset=([a-zA-Z0-9-]+)/);
alert(result[1]);

答案 3 :(得分:0)

对于PHP:

$charset = preg_match('/charset=([a-zA-Z0-9-]+)/', $line);
$charset = $charset[1];

答案 4 :(得分:0)

我倾向于同意@You,但我会给你你要求的答案以及其他一些解决方案。

        String meta = "<meta http-equiv=\"Content-type\" content=\"text/html;charset=utf-8\" />";
        String charSet = System.Text.RegularExpressions.Regex.Replace(meta,"<meta.*charset=([^\\s'\"]+).*","$1");

        // if meta tag has attributes encapsulated by double quotes
        String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('"'))[0];
        // if meta tag has attributes encapsulated by single quotes
        String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('\''))[0];

无论哪种方式都可以使用,但是如果没有首先检查数组是否有数据,那么String.Split命令肯定是危险的,所以可能想要突破上面的内容,否则你会得到NullException。 / p>

答案 5 :(得分:0)

我的正则表达式:

<meta[^>]*?charset=([^"'>]*)

我的测试用例:

<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<meta name="author" value="me"><!-- Maybe we should have a charset=something meta element? --><meta charset="utf-8">

C#-Code:

using System.Text.RegularExpressions;
string resultString = Regex.Match(sourceString, "<meta[^>]*?charset=([^\"'>]*)").Groups[1].Value;

正则表达式-描述:

// <meta[^>]*?charset=([^"'>]*)
// 
// Match the characters "<meta" literally «<meta»
// Match any character that is not a ">" «[^>]*?»
//    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "charset=" literally «charset=»
// Match the regular expression below and capture its match into backreference number 1 «([^"'>]*)»
//    Match a single character NOT present in the list ""'>" «[^"'>]*»
//       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

答案 6 :(得分:0)

此正则表达式将从任何元标记中捕获字符集

(?<=([<META|<meta])(.*)charset=)([^"'>]*)

示例输入:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta http-equiv=Content-Type content=text/html; charset=windows-1252>
<meta http-equiv=Content-Type content='text/html; charset=windows-1252'>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" /> 
<meta http-equiv="Content-type" content="text/html;charset=iso-8859-1" /> 

像这样使用:

Regex regexObj = new Regex("(?<=<meta(.*)charset=)([^\"'>]*)", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
    for (int i = 1; i < matchResults.Groups.Count; i++) {
        Group groupObj = matchResults.Groups[i];
        if (groupObj.Success) {
            // matched text: groupObj.Value
            // match start: groupObj.Index
            // match length: groupObj.Length
        } 
    }
    matchResults = matchResults.NextMatch();
} 

会找到这些值:

windows-1252

windows-1252

windows-1252

utf-8

iso-8859-1

答案 7 :(得分:0)

另请尝试:

<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([a-zA-Z0-9-]+)[\s"'\/]*>

答案 8 :(得分:-1)

Don't use regular expressions to parse (X)HTML!使用适当的工具,即SGML或XML解析器。您的代码看起来像XHTML,所以我尝试使用XML解析器。然而,从meta元素获取属性之后;正则表达式会更合适。虽然,只是在;分割的字符串肯定会起作用(也会更快)。