Question

我正在编写一个Perl脚本，它从许多不同的网站获取各种HTML文档，并尝试从中提取数据。解码这些文件时遇到问题。

我知道如何从元标记中读取charset（如果有），以及如何从HTTP标头中读取此信息。

结果可能是：

UTF-8
ISO-8859-1
SHIFT_JIS
Windows的1252

还有更多

有了这些知识，我想在我的Perl脚本中解码文档

#!/usr/bin/perl -w

use strict;

use LWP::UserAgent;
use Encode;
use Encode::JP;

# Maybe also use other extensions for Encode

my $ua = LWP::UserAgent->new;
my $response = $ua->get($url); #$url is the documents URL

if ( $response->is_success ) {

    my $charset = getcharset($response);
    # getcharset is a self-written subroutine that reads the charset
    # from a meta tag or from the HTTP header (not shown in this example)

    # Now I know the documents charset and want to find its encoding:

    my $encoding = 'utf-8'; # default

    if ($charset eq 'utf-8') {
        $encoding = 'utf-8'; # Here $encoding and $charset are equal

    }
    elsif ( $charset eq 'Shift_JIS' ) {
        $encoding = 'shiftjis'; #here $encoding and $charset are not equal
    }
    elsif ( $charset eq 'windows-1252' ) {
        # Here I have no idea what $encoding should be, since there is no
        # encoding in the documentation that contains the string "windows"

    }
    elsif ( $charset eq 'any other character set' ) {
        $encoding = ???
    }

    my $content = decode($encoding, $result->content);

    # Extract data from $content
}

但是我没能找到在野外存在的一些字符集的正确编码。

Answer 1

对于HTML文档，您只需要

my $content = $response->decoded_content();

它将根据需要使用HTTP标头中的 charset 属性值和META元素。

但是我没有找到在野外存在的一些字符集的正确编码。

Encode不会support所有已存在的编码，但我很惊讶您遇到了无法解码的HTML页面。它可能只是一个创建别名的情况，但您没有提供任何细节供我们帮助您。

Answer 2

见Encode::Supported。基本上，大多数编码都应该正常工作™。

binmode STDIN, ':encoding(Shift_JIS)';
binmode STDIN, ':encoding(windows-1252)';

两者都没有错误。

哪个HTML字符集的Perl编码？

2 个答案: