正则表达式获取HTML Doctype

时间:2013-04-24 14:25:03

标签: php regex

我的Html代码就像这样

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

或者这可以是这样的

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">

我希望得到Doc Type "XHTML 1.0 Strict"(第一个),"HTML 4.0"(第二个)。这个正则表达式代码是什么? 我喜欢在PHP preg_match()函数中使用它。

请在这种情况下帮助我。

7 个答案:

答案 0 :(得分:3)

如果doctypes将采用所示的形式,则可以使用

'#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i'

所以

preg_match('#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i', html, $match);  
echo $match[0];

答案 1 :(得分:3)

如何使用DOMDocumentDOMDocumentType

$xml = new DOMDocument(); 
$xml->loadHTMLFile($url);

$name = $xml->doctype->publicId; // -//W3C//DTD XHTML 1.0 Strict//EN

$doctype现在包含以下值:

DOMDocumentType Object
(
    [name] => html
    [entities] => (object value omitted)
    [notations] => (object value omitted)
    [publicId] => -//W3C//DTD XHTML 1.0 Strict//EN
    [systemId] => http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
    [internalSubset] => 
    [nodeName] => html
    [nodeValue] => 
    [nodeType] => 10
    [parentNode] => (object value omitted)
    [childNodes] => 
    [firstChild] => 
    [lastChild] => 
    [previousSibling] => 
    [nextSibling] => (object value omitted)
    [attributes] => 
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => 
    [baseURI] => 
    [textContent] => 
)

所以你现在可以轻松提取类型:

$name = $xml->doctype->publicId;
$name = preg_replace('~.*//DTD(.*?)//.*~', '$1', $name);
echo $name;

这将导致XHTML 1.0 Strict。使用phpfiddle示例here

答案 2 :(得分:1)

function contains($haystack, $needle){
    if (strpos($haystack,$needle) !== false) {
        return true;
    }else{
        return false;
    }
}
                $theDocType = "";
                $stringWithHTML = ""; // load some HTML in here from somewhere

                // Create DOM from HTML 
                $doc = new DOMDocument();
                //@$doc->loadHTMLFile("just_a_file.html");
                @$doc->loadHTML($stringWithHTML);

                // Grab document type
                $dtName = $doc->doctype->name;
                $dtPublic = $doc->doctype->publicId;
                if( $dtName="html" && $dtPublic!=""){           
                    // HTML or XHTML?
                    if(contains($dtPublic,"xhtml")){
                        $theDocType = "XHTML 1.0";
                    }else{
                        $theDocType = "HTML 4.01";
                    }
                    // Which type?
                    if(contains($dtPublic,"strict")){
                        $theDocType .= " (Strict)";
                    }elseif(contains($dtPublic,"transitional")){
                        $theDocType .= " (Transitional)";
                    }elseif(contains($dtPublic,"frameset")){
                        $theDocType .= " (Frameset)";
                    }else{
                        $theDocType = "XHTML 1.1"; // XHTML 1.1
                    }
                }else{
                    $theDocType = "HTML 5";
                }

                // Result
                echo $theDocType;

这将输出如下内容:
XHTML 1.1
HTML 5
HTML 4.01(严格)

答案 3 :(得分:0)

试试这个

<?php
   $html = file_get_contents("http://google.com");
   $html = str_replace("\n","",$html);
   $get_doctype = preg_match_all("/(<!DOCTYPE.+\">)<html/i",$html,$matches);
   $doctype = $matches[1][0];
?>

答案 4 :(得分:0)

'<!doctype.*?//dtd\s+([^/]*)//EN.*?dtd">'

这应该作为你的例子的模式。

答案 5 :(得分:0)

这个正则表达式提取“DTD”和“/”之间的所有内容,无需任何语法检查:

.*DTD\s+([^/]+)

此正则表达式提取文档类型并检查字符串中的一些语法:

<!DOCTYPE\s+\w*\s*\w*\s*"[-//\w\d]*DTD\s+([\w\d\s.]*)[^"]*[^>]*>

答案 6 :(得分:0)

我过去使用过这个帖子,但在测试过程中,我发现了一些大型文档的问题。有时,开发人员将doctype拆分为2行或3行。 在这种情况下,使用正则表达式不是最好的方法。

我在一行或几行中粘贴了doctypes的方法:

<?
class Doctype {
    var $html;
    var $doctype;
    var $version;
    function Doctype($html){
       $this->html = $html;
       $this->extractDoctype();
       $this->processDoctype();
    }
    private function extractDoctype(){
        $preDoctype = "";
        $preDoctypeValid = false;
        $lines = explode(PHP_EOL, $this->html);
        foreach ($lines as &$line) {
            $preDoctype = $preDoctype . $line;
            if(
                (strpos(strtolower($preDoctype), "<!doctype") !== false) && 
                (strpos(strtolower($preDoctype), ">") !== false)){
                $preDoctypeValid = true;
                break;
            }
        }
        if($preDoctypeValid){
            //Store only the pattern: <! doctype >
            $pos1 = strpos(strtolower($preDoctype), "<!doctype");
            $pos2 = strpos($preDoctype, ">", $pos1) + 1;
            $preDoctype = substr($preDoctype, $pos1, $pos2);            
        }else{
            $preDoctype = "";
        }
        $this->doctype = $preDoctype;
    }
    private function processDoctype(){
        $version = "";

        $pattern_html5 = "/<!doctype\s+?html\s?>/i";
        if (preg_match($pattern_html5, strtolower($this->doctype))) {
            $version = "HTML5";
        }else if(strpos(strtolower($this->doctype), "xhtml") !== false){
            $version = "XHTML";     
        }else if(strpos(strtolower($this->doctype), "html") !== false){
            if(strpos(strtolower($this->doctype), "3.2") !== false){
                $version = "HTML 3.2";  
            }
            if(strpos(strtolower($this->doctype), "4.01") !== false){
                $version = "HTML 4.01"; 
            }
            if(strpos(strtolower($this->doctype), "2.0") !== false){
                $version = "HTML 2.0";  
            }
        }else{
            $version = "OTHER";
        }
        $this->version = $version;
    }
    public function getDoctype(){
        return $this->doctype;
    }
    public function getDoctypeVersion(){
        return $this->version;
    }
}
?>

https://github.com/jabrena/WTAnalyzer/blob/master/r_php/document/Doctype.class.php