HTML代码示例:
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
我想使用RegEx来提取字符集信息(即这里,它是“utf-8”)
(我正在使用C#)
答案 0 :(得分:16)
我的回答提供了一个更强大的@ Floyd版本,并且在可能的情况下,解决了@ You的破损测试案例,其中使用负向前瞻来避免它。实际上只有一个我能想到的相关案例(@ You的例子的变体)会给出误报,但我认为这种情况非常罕见。表达式应使用不区分大小写的标志运行,并使用java.util.regex和JRegex进行测试。
自动修剪捕获组并且不会包含引号,也不会包含其他标记字符,例如“/”或“&gt;”。在第二个表达式中,有2个捕获组;第一个是内容类型值,可能是空的(即,当使用字符集属性时),第二个是字符集值,它始终是非空的(除非字符集值由于某些奇怪的原因而实际上是空的)。
仅限匹配/分组字符集值的正则表达式 - 修剪,跳过引号
<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([^\s"'/>]*)
与上面相同,但也匹配/分组内容类型(可选)和charset(必需)值,修剪,跳过引号。次要警告 - 错过匹配独立内容类型值,即“text / html”
<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s"']*)?([^>]*?)[\s"';]*charset\s*=[\s"']*([^\s"'/>]*)
测试用例(除最后一个之外全部通过)......
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'/>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' />
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1/>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 />
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" >
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' >
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 >
<meta http-equiv="Content-Type" content="text/html;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";charset="iso-8859-1"'>
<meta http-equiv="Content-Type" content="text/html;;;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;;;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;;;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';;;charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;;;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;;;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;;;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";;;charset="iso-8859-1"'>
<meta http-equiv = " Content-Type " content = " ' text/html ' ; ;; ' ; ' ' ; ' ; ' ;; ; charset = ' iso-8859-1 ' " >
<meta content = " ' text/html ' ; ;; ' ; ' ' ; ' ; ' ;; ; charset = ' iso-8859-1 ' " http-equiv = " Content-Type " >
<meta http-equiv = Content-Type content = text/html;charset=iso-8859-1 >
<meta content = text/html;charset=iso-8859-1 http-equiv = Content-Type >
<meta http-equiv = Content-Type content = text/html ; charset = iso-8859-1 >
<meta content = text/html ; charset = iso-8859-1 http-equiv = Content-Type >
<meta http-equiv = Content-Type content = text/html ;;; charset = iso-8859-1 >
<meta content = text/html ;;; charset = iso-8859-1 http-equiv = Content-Type >
<meta http-equiv = Content-Type content = text/html ; ; ; charset = iso-8859-1 >
<meta content = text/html ; ; ; charset = iso-8859-1 http-equiv = Content-Type >
<meta charset="utf-8"/>
<meta charset="utf-8" />
<meta charset='utf-8'/>
<meta charset='utf-8' />
<meta charset=utf-8/>
<meta charset=utf-8 />
<meta charset="utf-8">
<meta charset="utf-8" >
<meta charset='utf-8'>
<meta charset='utf-8' >
<meta charset=utf-8>
<meta charset=utf-8 >
<meta charset = " utf-8 " >
<meta charset = ' utf-8 ' >
<meta charset = " utf-8 ' >
<meta charset = ' utf-8 " >
<meta charset = " utf-8 >
<meta charset = ' utf-8 >
<meta charset = utf-8 ' >
<meta charset = utf-8 " >
<meta charset = utf-8 >
<meta charset = utf-8 />
<meta name="title" value="charset=utf-8 — is it really useful (yep)?">
<meta value="charset=utf-8 — is it really useful (yep)?" name="title">
<meta name="title" content="charset=utf-8 — is it really useful (yep)?">
<meta name="charset=utf-8" content="charset=utf-8 — is it really useful (yep)?">
<meta content="charset=utf-8 — is it really useful (nope, not here, but gotta admit pretty robust otherwise)?" name="title">
答案 1 :(得分:8)
这个正则表达式:
<meta.*?charset=([^"']+)
应该有效。使用XML解析器提取此是过度杀伤。
答案 2 :(得分:0)
我试着用javascript将你的字符串放在变量中并进行匹配:
var x = '<meta http-equiv="Content-type" content="text/html;charset=utf-8" />';
var result = x.match(/charset=([a-zA-Z0-9-]+)/);
alert(result[1]);
答案 3 :(得分:0)
对于PHP:
$charset = preg_match('/charset=([a-zA-Z0-9-]+)/', $line); $charset = $charset[1];
答案 4 :(得分:0)
我倾向于同意@You,但我会给你你要求的答案以及其他一些解决方案。
String meta = "<meta http-equiv=\"Content-type\" content=\"text/html;charset=utf-8\" />";
String charSet = System.Text.RegularExpressions.Regex.Replace(meta,"<meta.*charset=([^\\s'\"]+).*","$1");
// if meta tag has attributes encapsulated by double quotes
String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('"'))[0];
// if meta tag has attributes encapsulated by single quotes
String charSet = ((meta.Split(new String[] { "charset=" }, StringSplitOptions.None))[1].Split('\''))[0];
无论哪种方式都可以使用,但是如果没有首先检查数组是否有数据,那么String.Split命令肯定是危险的,所以可能想要突破上面的内容,否则你会得到NullException。 / p>
答案 5 :(得分:0)
我的正则表达式:
<meta[^>]*?charset=([^"'>]*)
我的测试用例:
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<meta name="author" value="me"><!-- Maybe we should have a charset=something meta element? --><meta charset="utf-8">
C#-Code:
using System.Text.RegularExpressions;
string resultString = Regex.Match(sourceString, "<meta[^>]*?charset=([^\"'>]*)").Groups[1].Value;
正则表达式-描述:
// <meta[^>]*?charset=([^"'>]*)
//
// Match the characters "<meta" literally «<meta»
// Match any character that is not a ">" «[^>]*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "charset=" literally «charset=»
// Match the regular expression below and capture its match into backreference number 1 «([^"'>]*)»
// Match a single character NOT present in the list ""'>" «[^"'>]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
答案 6 :(得分:0)
此正则表达式将从任何元标记中捕获字符集值:
(?<=([<META|<meta])(.*)charset=)([^"'>]*)
示例输入:
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta http-equiv=Content-Type content=text/html; charset=windows-1252>
<meta http-equiv=Content-Type content='text/html; charset=windows-1252'>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<meta http-equiv="Content-type" content="text/html;charset=iso-8859-1" />
像这样使用:
Regex regexObj = new Regex("(?<=<meta(.*)charset=)([^\"'>]*)", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
for (int i = 1; i < matchResults.Groups.Count; i++) {
Group groupObj = matchResults.Groups[i];
if (groupObj.Success) {
// matched text: groupObj.Value
// match start: groupObj.Index
// match length: groupObj.Length
}
}
matchResults = matchResults.NextMatch();
}
会找到这些值:
windows-1252
windows-1252
windows-1252
utf-8
iso-8859-1
答案 7 :(得分:0)
另请尝试:
<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([a-zA-Z0-9-]+)[\s"'\/]*>
答案 8 :(得分:-1)
Don't use regular expressions to parse (X)HTML!使用适当的工具,即SGML或XML解析器。您的代码看起来像XHTML,所以我尝试使用XML解析器。然而,从meta元素获取属性之后;正则表达式会更合适。虽然,只是在;
分割的字符串肯定会起作用(也会更快)。