我的Html代码就像这样
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
或者这可以是这样的
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
我希望得到Doc Type
"XHTML 1.0 Strict"
(第一个),"HTML 4.0"
(第二个)。这个正则表达式代码是什么?
我喜欢在PHP preg_match()
函数中使用它。
请在这种情况下帮助我。
答案 0 :(得分:3)
如果doctypes将采用所示的形式,则可以使用
'#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i'
所以
preg_match('#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i', html, $match);
echo $match[0];
答案 1 :(得分:3)
如何使用DOMDocument
和DOMDocumentType
?
$xml = new DOMDocument();
$xml->loadHTMLFile($url);
$name = $xml->doctype->publicId; // -//W3C//DTD XHTML 1.0 Strict//EN
$doctype
现在包含以下值:
DOMDocumentType Object
(
[name] => html
[entities] => (object value omitted)
[notations] => (object value omitted)
[publicId] => -//W3C//DTD XHTML 1.0 Strict//EN
[systemId] => http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
[internalSubset] =>
[nodeName] => html
[nodeValue] =>
[nodeType] => 10
[parentNode] => (object value omitted)
[childNodes] =>
[firstChild] =>
[lastChild] =>
[previousSibling] =>
[nextSibling] => (object value omitted)
[attributes] =>
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] =>
[baseURI] =>
[textContent] =>
)
所以你现在可以轻松提取类型:
$name = $xml->doctype->publicId;
$name = preg_replace('~.*//DTD(.*?)//.*~', '$1', $name);
echo $name;
这将导致XHTML 1.0 Strict
。使用phpfiddle示例here。
答案 2 :(得分:1)
function contains($haystack, $needle){
if (strpos($haystack,$needle) !== false) {
return true;
}else{
return false;
}
}
$theDocType = "";
$stringWithHTML = ""; // load some HTML in here from somewhere
// Create DOM from HTML
$doc = new DOMDocument();
//@$doc->loadHTMLFile("just_a_file.html");
@$doc->loadHTML($stringWithHTML);
// Grab document type
$dtName = $doc->doctype->name;
$dtPublic = $doc->doctype->publicId;
if( $dtName="html" && $dtPublic!=""){
// HTML or XHTML?
if(contains($dtPublic,"xhtml")){
$theDocType = "XHTML 1.0";
}else{
$theDocType = "HTML 4.01";
}
// Which type?
if(contains($dtPublic,"strict")){
$theDocType .= " (Strict)";
}elseif(contains($dtPublic,"transitional")){
$theDocType .= " (Transitional)";
}elseif(contains($dtPublic,"frameset")){
$theDocType .= " (Frameset)";
}else{
$theDocType = "XHTML 1.1"; // XHTML 1.1
}
}else{
$theDocType = "HTML 5";
}
// Result
echo $theDocType;
这将输出如下内容:
XHTML 1.1
HTML 5
HTML 4.01(严格)
答案 3 :(得分:0)
试试这个
<?php
$html = file_get_contents("http://google.com");
$html = str_replace("\n","",$html);
$get_doctype = preg_match_all("/(<!DOCTYPE.+\">)<html/i",$html,$matches);
$doctype = $matches[1][0];
?>
答案 4 :(得分:0)
'<!doctype.*?//dtd\s+([^/]*)//EN.*?dtd">'
这应该作为你的例子的模式。
答案 5 :(得分:0)
这个正则表达式提取“DTD”和“/”之间的所有内容,无需任何语法检查:
.*DTD\s+([^/]+)
此正则表达式提取文档类型并检查字符串中的一些语法:
<!DOCTYPE\s+\w*\s*\w*\s*"[-//\w\d]*DTD\s+([\w\d\s.]*)[^"]*[^>]*>
答案 6 :(得分:0)
我过去使用过这个帖子,但在测试过程中,我发现了一些大型文档的问题。有时,开发人员将doctype拆分为2行或3行。 在这种情况下,使用正则表达式不是最好的方法。
我在一行或几行中粘贴了doctypes的方法:
<?
class Doctype {
var $html;
var $doctype;
var $version;
function Doctype($html){
$this->html = $html;
$this->extractDoctype();
$this->processDoctype();
}
private function extractDoctype(){
$preDoctype = "";
$preDoctypeValid = false;
$lines = explode(PHP_EOL, $this->html);
foreach ($lines as &$line) {
$preDoctype = $preDoctype . $line;
if(
(strpos(strtolower($preDoctype), "<!doctype") !== false) &&
(strpos(strtolower($preDoctype), ">") !== false)){
$preDoctypeValid = true;
break;
}
}
if($preDoctypeValid){
//Store only the pattern: <! doctype >
$pos1 = strpos(strtolower($preDoctype), "<!doctype");
$pos2 = strpos($preDoctype, ">", $pos1) + 1;
$preDoctype = substr($preDoctype, $pos1, $pos2);
}else{
$preDoctype = "";
}
$this->doctype = $preDoctype;
}
private function processDoctype(){
$version = "";
$pattern_html5 = "/<!doctype\s+?html\s?>/i";
if (preg_match($pattern_html5, strtolower($this->doctype))) {
$version = "HTML5";
}else if(strpos(strtolower($this->doctype), "xhtml") !== false){
$version = "XHTML";
}else if(strpos(strtolower($this->doctype), "html") !== false){
if(strpos(strtolower($this->doctype), "3.2") !== false){
$version = "HTML 3.2";
}
if(strpos(strtolower($this->doctype), "4.01") !== false){
$version = "HTML 4.01";
}
if(strpos(strtolower($this->doctype), "2.0") !== false){
$version = "HTML 2.0";
}
}else{
$version = "OTHER";
}
$this->version = $version;
}
public function getDoctype(){
return $this->doctype;
}
public function getDoctypeVersion(){
return $this->version;
}
}
?>
https://github.com/jabrena/WTAnalyzer/blob/master/r_php/document/Doctype.class.php