Simple_Html_Dom如何解析汉字

时间:2014-06-15 06:53:29

标签: php

想尝试从taobao网站抓取数据。

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title></title>
    </head>
    <body>
        <?php
        include_once('simple_html_dom.php');
        $target_url = "http://item.taobao.com/item.htm?spm=a2106.m893.1000384.54.61Q4Fp&id=37676614376&_u=fm86qe4d813&scm=1029.newlist-0.1.50006843&ppath=&sku=&ug=#detail";
        $html = new simple_html_dom();
        $html->load_file($target_url);
        foreach ($html->find('h3[class=tb-main-title]') as $post) {
            echo html_entity_decode($post, ENT_QUOTES, "ISO-8859-1") . "<br />";
        }
        ?>
    </body>
</html>

但它显示了产品标题:

2014��ЬŮʿ�������¿��ϸ��ƽ���ļ��¿����ϴ���ƽ����Ь��  

1 个答案:

答案 0 :(得分:0)

为了避免这种情况,您需要使用iconv功能。考虑这个例子:

include 'simple_html_dom.php';
$target_url = "http://item.taobao.com/item.htm?spm=a2106.m893.1000384.54.61Q4Fp&id=37676614376&_u=fm86qe4d813&scm=1029.newlist-0.1.50006843&ppath=&sku=&ug=#detail";
$contents = file_get_contents($target_url);

$html = str_get_html($contents);
foreach($html->find('h3[class=tb-main-title]') as $post) {
    $text = $post->innertext;
    $text = iconv('gb2312', 'utf-8', $text);
    echo $text;
    // 2014拖鞋女士人字拖新款豹纹细带平底夏季新款凉拖大码平底拖鞋潮
}