Question

我可以使用Mojo::DOM及其CSS3选择器来确定HTML文档的DOCTYPE吗？与我的其他问题How should I process HTML META tags with Mojo::UserAgent?相关，我想要设置文档的字符集，我需要知道要查看什么，doctype sniffing似乎是要做的事情。当文档设置覆盖服务器设置（或非设置）时，HTML和HTML 5对HTML中的字符集具有不同的元标记。

我没有完成任务的问题，因为我可以抓住原始响应并使用正则表达式来查看DOCTYPE。 Since the browser DOMs seem to be able to get the DOCTYPE，我感染了我应该能够得到它的想法。然而，由于缺乏实例，我认为没有人按照我认为应该这样做的方式来做。

我尝试了许多愚蠢的方法，但我的CSS功夫很弱：

use v5.20;

use feature qw(signatures);
no warnings qw(experimental::signatures);

use Mojo::DOM;

my $html = do { local $/; <DATA> };

my $dom = Mojo::DOM->new( $html );

say "<title> is => ", $dom->find( 'head title' )->map( 'text' )->each;

say "Doctype with find is => ", $dom->find( '!doctype' )->map( 'text' )->each;

say "Doctype with nodes is => ", $dom->[0];

__DATA__

<!DOCTYPE html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>Level 1</h1>
</body>
</html>

当我转储$dom对象时，我在树中看到了DOCTYPE：

$VAR1 = bless( do{\(my $o = bless( {
                      'tree' => [
                                  'root',
                                  [
                                    'text',
                                    '',
                                    ${$VAR1}->{'tree'}
                                  ],
                                  [
                                    'doctype',
                                    ' html',
                                    ${$VAR1}->{'tree'}
                                  ],

现在我该怎么做？

Answer 1

确定HTML5文档的编码非常complex。我担心Mojo::DOM只是一个片段解析器，因此我们已经确定编码嗅探算法的完整实现将超出范围。谢天谢地，大多数网络都是UTF-8编码的，我想这就是为什么这个问题不经常出现的原因。

Answer 2

我仍然认为有更好的方法可以做到这一点，但也许我对Mojo::UserAgent承担了太多的责任。我可以构建一个事务并在响应中添加finish事件。在那种情况下，我使用正则表达式嗅探内容并添加带有doc类型的X-标头。我可能会以其他方式传递信息，但这不是重点（尽管仍在采取建议！）

use v5.14;

use Mojo::UserAgent;

@ARGV = qw(http://blogs.perl.org);

my $ua = Mojo::UserAgent->new;

my $tx = $ua->build_tx( GET => $ARGV[0] );
$tx->res->on( finish => sub {
    my $res = shift;
    my( $doctype ) = $res->body =~ m/\A \s* (<!DOCTYPE.*?>)/isx;
    if( $doctype ) {
        say "Found doctype => $doctype";
        $res->headers->header( 'X-doctype', $doctype );
        }
    });
$tx = $ua->start($tx);

say "-----Headers-----";
say $tx->res->headers->to_string =~ s/\R+/\n/rg;

这是输出：

Found doctype => <!DOCTYPE html>
-----Headers-----
Connection: Keep-Alive
Server: Apache/2.2.12 (Ubuntu)
Content-Type: text/html
Content-Length: 20624
Accept-Ranges: bytes
X-doctype: <!DOCTYPE html>
Last-Modified: Wed, 16 Sep 2015 13:08:26 GMT
ETag: "26d42e8-5090-51fdcfe768680"
Date: Wed, 16 Sep 2015 13:40:02 GMT
Keep-Alive: timeout=15, max=100
Vary: Accept-Encoding

现在我必须考虑解析DOCTYPE值的各种事情，并根据内容做出决定。

Doctype使用CSS3进行嗅探，特别是使用Mojo :: DOM

2 个答案: