HTML整洁后网页上的奇怪字符

时间:2011-03-06 10:22:57

标签: php codeigniter htmltidy html-entities

我通过亚马逊网络服务获取内容(例如产品说明)。由于来自亚马逊的内容通常标记得非常糟糕,因此最终会弄乱我的网页布局。所以,我提出了一个使用HTML Tidy来“清理”内容的功能。

奇怪的是,当我将它与我的应用程序分开测试时,一切似乎都运行正常。但是在我的应用程序中(在CodeIgniter上运行),该函数似乎返回奇数字符。

下面的代码是我的测试脚本。它正在输出我认为我需要的东西。

在我的应用程序中,我从数据库中获取描述,清理它,然后将其显示在我的网页上。例如,在清理之后,document’s(您可以在下面的示例中看到这个词)变为document&acirc;&euro;&trade;s(同样,仅在实际应用程序中;不在测试代码中。两个函数都相同)。< / p>

任何想法为什么?这是我的测试功能:

    $amazon_content = <<<AMAZON
JavaScript is the brains of your Web page—it enables you to modify a document’s structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin' with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br><br>Today’s application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br><br>Scriptin' with JavaScript and Ajax will teach you how to:<br><ul><li>Start developing with JavaScript fast!</li></ul><ul><li>Write lightweight but powerful object-oriented code </li></ul><ul><li>Modify the Document Object Model </li></ul><ul><li>“Progressively enhance” your pages with JavaScript to provide the highest levels of accessibility to all users</li></ul><ul><li>Learn sophisticated techniques for making your pages respond to user actions</li></ul><ul><li>Use the downloadable Scriptin’ library of helper functions to speed development and ensure cross-browser compatibility</li></ul><ul><li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li></ul><ul><li>Create powerful interface interactions, such as sliding panels and tree menus</li></ul><ul><li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li></ul><ul><li>Build an online application that looks and responds like a regular desktop application</li></ul><ul><li>Easily adapt the Scriptin’ code examples for use in your own projects—download them at www.scriptinwithajax.com</li></ul><br>
AMAZON;

    echo '<textarea cols="150" rows="12">' . $amazon_content . '</textarea>';
    echo '<textarea cols="150" rows="12">' . get_sanitized_amazon_content($amazon_content) . '</textarea>';
    echo  get_sanitized_amazon_content($amazon_content);

    function get_sanitized_amazon_content($amazon_content)
    {
        $tidy_config             = array(
            'bare' => TRUE,
            'clean' => TRUE,
            'drop-empty-paras' => TRUE,
            'drop-font-tags' => TRUE,
            'drop-proprietary-attributes' => TRUE,
            'enclose-text' => TRUE,
            'fix-backslash' => TRUE,
            'fix-bad-comments' => TRUE,
            'fix-uri' => TRUE,
            'hide-comments' => TRUE,
            'hide-endtags' => TRUE,
            'logical-emphasis' => TRUE,
            'lower-literals' => TRUE,
            'merge-divs' => TRUE,
            'output-xhtml' => TRUE,
            'quote-ampersand' => TRUE,
            'quote-marks' => TRUE,
            'show-body-only' => TRUE,
            'word-2000' => TRUE
        );
        $tidy                    = new tidy();
        $sanitized_amazon_markup = $tidy->repairString($amazon_content, $tidy_config);

        // Replace carriage returns, line feeds, tabs with single space
        $sanitized_amazon_markup = preg_replace('/\r|\n|\t/', ' ', $sanitized_amazon_markup);

        // Removes unnecessary tags
        // TODO: get complete list; put in an array
        $sanitized_amazon_markup = strip_tag($sanitized_amazon_markup, 'div');
        $sanitized_amazon_markup = strip_tag($sanitized_amazon_markup, 'span');

        // Replace double spaces with single space
        $sanitized_amazon_markup = preg_replace('/ {2,}/i', ' ', $sanitized_amazon_markup);

        // Remove leading and trailing space
        $sanitized_amazon_markup = trim($sanitized_amazon_markup);

        return $sanitized_amazon_markup;
    }

    function strip_tag($tagged_content, $tag_name)
    {
        return preg_replace('%<[ \t\r\n]*/?[ \t\r\n]*' . $tag_name . '.*?>%i', '', $tagged_content);
    }

更新

这就是我在申请中的内容:

<p>JavaScript is the brains of your Web page&acirc;&euro;&quot;it enables you to modify a document&acirc;&euro;&trade;s structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin&#39; with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br /> <br /> Today&acirc;&euro;&trade;s application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br /> <br /> Scriptin&#39; with JavaScript and Ajax will teach you how to:<br /></p> <ul> <li>Start developing with JavaScript fast!</li> </ul> <ul> <li>Write lightweight but powerful object-oriented code</li> </ul> <ul> <li>Modify the Document Object Model</li> </ul> <ul> <li>&acirc;&euro;&oelig;Progressively enhance&acirc;&euro; your pages with JavaScript to provide the highest levels of accessibility to all users</li> </ul> <ul> <li>Learn sophisticated techniques for making your pages respond to user actions</li> </ul> <ul> <li>Use the downloadable Scriptin&acirc;&euro;&trade; library of helper functions to speed development and ensure cross-browser compatibility</li> </ul> <ul> <li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li> </ul> <ul> <li>Create powerful interface interactions, such as sliding panels and tree menus</li> </ul> <ul> <li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li> </ul> <ul> <li>Build an online application that looks and responds like a regular desktop application</li> </ul> <ul> <li>Easily adapt the Scriptin&acirc;&euro;&trade; code examples for use in your own projects&acirc;&euro;&quot;download them at www.scriptinwithajax.com</li> </ul> <p><br /></p>

这是我在申请之外得到的结果:

<p>JavaScript is the brains of your Web page-it enables you to modify a document's structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin' with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br /> <br /> Today's application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br /> <br /> Scriptin' with JavaScript and Ajax will teach you how to:<br /></p> <ul> <li>Start developing with JavaScript fast!</li> </ul> <ul> <li>Write lightweight but powerful object-oriented code</li> </ul> <ul> <li>Modify the Document Object Model</li> </ul> <ul> <li>"Progressively enhance" your pages with JavaScript to provide the highest levels of accessibility to all users</li> </ul> <ul> <li>Learn sophisticated techniques for making your pages respond to user actions</li> </ul> <ul> <li>Use the downloadable Scriptin' library of helper functions to speed development and ensure cross-browser compatibility</li> </ul> <ul> <li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li> </ul> <ul> <li>Create powerful interface interactions, such as sliding panels and tree menus</li> </ul> <ul> <li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li> </ul> <ul> <li>Build an online application that looks and responds like a regular desktop application</li> </ul> <ul> <li>Easily adapt the Scriptin' code examples for use in your own projects-download them at www.scriptinwithajax.com</li> </ul> <p><br /></p>

1 个答案:

答案 0 :(得分:3)

“page”和“it”之间的-不是简单的减号(ascii 0x2d),而是长划线(特别是U+2014 em dash)。以UTF-8编码,它是一个三字节序列:0xe2 0x80 0x94。

如果您在Windows-1252 encoding中解释该序列,则会给您:

0xe2 => â => &acirc;
0x80 => € => &euro;
0x94 => (some variant of) double quote => &quot;

所以你有一个编码问题。您将获得UTF-8作为输入,但将其解释为Windows-1252。你的整理是将非ASCII7部分转换为HTML实体,就像它应该的那样。

至于为什么在您的应用内部而不是在外部发生这种情况,有一些可能性。一个是你在外部和内部没有相同的区域设置/编码配置。另一个原因是,当您在应用程序之外进行测试时,您没有获得与来自网络的数据完全相同的数据 - 即您获得的编码不同(可能已更改)。