Perl中的波兰字符[HTML :: TreeBuilder和utf8输入文件]

时间:2014-07-04 14:22:29

标签: html perl encoding utf-8 special-characters


我曾尝试使用我在Stack Overflow和其他论坛上找到的所有内容,但由于某些原因对其他人有用的东西对我不起作用。 我用过:

use open qw(:std :utf8);
use HTML::TreeBuilder qw( );
use Object::Destroyer qw( );
#and many others;


use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
use File::Find;
use Encode;

my $location="C:\\MyLocation";
open (MYFILE, '>>data.txt');

sub find_txt {    

    my $F = $File::Find::name;

    if ($F =~ /index.html$/ ) {

       my $tr = HTML::TreeBuilder->new->parse_file('index.html');

        for my $div ($tr->look_down(_tag => 'h2', 'class' => 'featured')) {
           say $div->as_text;   
           print (MYFILE $div->as_text);

    for my $div ($tr->look_down(_tag => 'div', 'class' => 'post-content')) {
        for my $t ($div->look_down(_tag => 'p')) {
            say $t->as_text;
            print (MYFILE $t->as_text);

    for my $div ($tr->look_down(_tag => 'h4', 'class' => 'related-posts')) {
        for my $t ($div->look_down(_tag => 'a')) {
            say $t->as_text;
            print (MYFILE $t->as_text);



find(\&find_txt, $location);
close (MYFILE);


<div class="post-content">
  <p>(łac. abacus)</p>
  <p>1. płyta będąca najwyższą częścią kolumny</p>
  <p>2. w starożytności &#8211; deska do liczenia, pierwowzór liczydła</p>

我不确定您是否能够在浏览器中显示波兰字符,但是由unicode编码的一些字符为104,106,118,141,143,D3,15A,179,17B,105,107 ,119,142,144,F3,15B,17A,17C

1 个答案:

答案 0 :(得分:3)

HTML :: TreeBuilder parse_file - charset autodetection


open (my $MYFILE, '>>:utf8','index.html'); # explicitly open MYFILE with utf8 charset
my $tr = HTML::TreeBuilder->new->parse_file($MYFILE);

OR 使用IO :: HTML自动检测已打开文件的字符集。

use IO::HTML;                 # exports html_file by default
my $tr = HTML::TreeBuilder->new->parse_file(html_file('index.html'));

man HTML :: TreeBuilder

   When you pass a filename to "parse_file", HTML::Parser opens it in binary mode,
   which means it's interpreted as Latin-1 (ISO-8859-1).  If the file
   is in another encoding, like UTF-8 or UTF-16, this will not do the right thing.
   For opening a HTML file with automatic charset detection: IO::HTML.