将重音字符和HTML实体转换为UTF-8?

时间:2014-12-29 05:04:55

标签: php html encoding utf-8 html-entities

我正在开展一个项目,这个项目可以让我从Portkey.org下载故事来阅读我的点燃内容,而且我无法为我的生活找出如何正确编码/解析从网站上抓取HTML。我正在使用simple_html_dom来抓取它,并且正在传递故事所在主要元素的innertext进行解析。

所以我在这里要完成的工作如下:

  1. 从Portkey.org故事中获取HTML
  2. 将页面上的所有HTML实体转换为常规字符以供阅读(”“…至{{1等实体等等)
  3. 任何其他语言的重音字符或字符(如韩语,日语,中文等)都应保持原样。
  4. 使用修复HTML并将其保存到tidy文件。
  5. 到目前为止,我尝试的所有内容都会产生以下任何一种情况:

    • 带有问号的钻石,其中重音字符应为
    • UTF-8字符损坏,应该有引号和省略号,但重音字符显示正确

    故事HTML中的示例:

    .html

    修改

    <p> Wel [snip] your emotions&hellip;but most impor [snip] ng fiancé </p> 会产生以下输出:

    html_entity_decode

    如您所见,重音字符是正确的,但 Wel [snip] your emotions…but most impor [snip] ng fiancé 现在显示不正确。

    编辑2:

    &hellip;的结果:

    get_html_translation_table(HTML_ENTITIES)

    编辑3:

    只是为了完整披露,这里是我为了解决这个问题而设置的测试文件。目前,所有实体都正确显示,但重音字符显示为array(252) { ["""]=> string(6) """ ["&"]=> string(5) "&" ["<"]=> string(4) "<" [">"]=> string(4) ">" [" "]=> string(6) " " ["¡"]=> string(7) "¡" ["¢"]=> string(6) "¢" ["£"]=> string(7) "£" ["¤"]=> string(8) "¤" ["Â¥"]=> string(5) "¥" ["¦"]=> string(8) "¦" ["§"]=> string(6) "§" ["¨"]=> string(5) "¨" ["©"]=> string(6) "©" ["ª"]=> string(6) "ª" ["«"]=> string(7) "«" ["¬"]=> string(5) "¬" ["­"]=> string(5) "­" ["®"]=> string(5) "®" ["¯"]=> string(6) "¯" ["°"]=> string(5) "°" ["±"]=> string(8) "±" ["²"]=> string(6) "²" ["³"]=> string(6) "³" ["´"]=> string(7) "´" ["µ"]=> string(7) "µ" ["¶"]=> string(6) "¶" ["·"]=> string(8) "·" ["¸"]=> string(7) "¸" ["¹"]=> string(6) "¹" ["º"]=> string(6) "º" ["»"]=> string(7) "»" ["¼"]=> string(8) "¼" ["½"]=> string(8) "½" ["¾"]=> string(8) "¾" ["¿"]=> string(8) "¿" ["À"]=> string(8) "À" ["Ã"]=> string(8) "Á" ["Â"]=> string(7) "Â" ["Ã"]=> string(8) "Ã" ["Ä"]=> string(6) "Ä" ["Ã…"]=> string(7) "Å" ["Æ"]=> string(7) "Æ" ["Ç"]=> string(8) "Ç" ["È"]=> string(8) "È" ["É"]=> string(8) "É" ["Ê"]=> string(7) "Ê" ["Ë"]=> string(6) "Ë" ["ÃŒ"]=> string(8) "Ì" ["Ã"]=> string(8) "Í" ["ÃŽ"]=> string(7) "Î" ["Ã"]=> string(6) "Ï" ["Ã"]=> string(5) "Ð" ["Ñ"]=> string(8) "Ñ" ["Ã’"]=> string(8) "Ò" ["Ó"]=> string(8) "Ó" ["Ô"]=> string(7) "Ô" ["Õ"]=> string(8) "Õ" ["Ö"]=> string(6) "Ö" ["×"]=> string(7) "×" ["Ø"]=> string(8) "Ø" ["Ù"]=> string(8) "Ù" ["Ú"]=> string(8) "Ú" ["Û"]=> string(7) "Û" ["Ãœ"]=> string(6) "Ü" ["Ã"]=> string(8) "Ý" ["Þ"]=> string(7) "Þ" ["ß"]=> string(7) "ß" ["à "]=> string(8) "à" ["á"]=> string(8) "á" ["â"]=> string(7) "â" ["ã"]=> string(8) "ã" ["ä"]=> string(6) "ä" ["Ã¥"]=> string(7) "å" ["æ"]=> string(7) "æ" ["ç"]=> string(8) "ç" ["è"]=> string(8) "è" ["é"]=> string(8) "é" ["ê"]=> string(7) "ê" ["ë"]=> string(6) "ë" ["ì"]=> string(8) "ì" ["í"]=> string(8) "í" ["î"]=> string(7) "î" ["ï"]=> string(6) "ï" ["ð"]=> string(5) "ð" ["ñ"]=> string(8) "ñ" ["ò"]=> string(8) "ò" ["ó"]=> string(8) "ó" ["ô"]=> string(7) "ô" ["õ"]=> string(8) "õ" ["ö"]=> string(6) "ö" ["÷"]=> string(8) "÷" ["ø"]=> string(8) "ø" ["ù"]=> string(8) "ù" ["ú"]=> string(8) "ú" ["û"]=> string(7) "û" ["ü"]=> string(6) "ü" ["ý"]=> string(8) "ý" ["þ"]=> string(7) "þ" ["ÿ"]=> string(6) "ÿ" ["Å’"]=> string(7) "Œ" ["Å“"]=> string(7) "œ" ["Å "]=> string(8) "Š" ["Å¡"]=> string(8) "š" ["Ÿ"]=> string(6) "Ÿ" ["Æ’"]=> string(6) "ƒ" ["ˆ"]=> string(6) "ˆ" ["Ëœ"]=> string(7) "˜" ["Α"]=> string(7) "Α" ["Î’"]=> string(6) "Β" ["Γ"]=> string(7) "Γ" ["Δ"]=> string(7) "Δ" ["Ε"]=> string(9) "Ε" ["Ζ"]=> string(6) "Ζ" ["Η"]=> string(5) "Η" ["Θ"]=> string(7) "Θ" ["Ι"]=> string(6) "Ι" ["Κ"]=> string(7) "Κ" ["Λ"]=> string(8) "Λ" ["Îœ"]=> string(4) "Μ" ["Î"]=> string(4) "Ν" ["Ξ"]=> string(4) "Ξ" ["Ο"]=> string(9) "Ο" ["Î "]=> string(4) "Π" ["Ρ"]=> string(5) "Ρ" ["Σ"]=> string(7) "Σ" ["Τ"]=> string(5) "Τ" ["Î¥"]=> string(9) "Υ" ["Φ"]=> string(5) "Φ" ["Χ"]=> string(5) "Χ" ["Ψ"]=> string(5) "Ψ" ["Ω"]=> string(7) "Ω" ["α"]=> string(7) "α" ["β"]=> string(6) "β" ["γ"]=> string(7) "γ" ["δ"]=> string(7) "δ" ["ε"]=> string(9) "ε" ["ζ"]=> string(6) "ζ" ["η"]=> string(5) "η" ["θ"]=> string(7) "θ" ["ι"]=> string(6) "ι" ["κ"]=> string(7) "κ" ["λ"]=> string(8) "λ" ["μ"]=> string(4) "μ" ["ν"]=> string(4) "ν" ["ξ"]=> string(4) "ξ" ["ο"]=> string(9) "ο" ["Ï€"]=> string(4) "π" ["Ï"]=> string(5) "ρ" ["Ï‚"]=> string(8) "ς" ["σ"]=> string(7) "σ" ["Ï„"]=> string(5) "τ" ["Ï…"]=> string(9) "υ" ["φ"]=> string(5) "φ" ["χ"]=> string(5) "χ" ["ψ"]=> string(5) "ψ" ["ω"]=> string(7) "ω" ["Ï‘"]=> string(10) "ϑ" ["Ï’"]=> string(7) "ϒ" ["Ï–"]=> string(5) "ϖ" [" "]=> string(6) " " [" "]=> string(6) " " [" "]=> string(8) " " ["‌"]=> string(6) "‌" ["â€"]=> string(5) "‍" ["‎"]=> string(5) "‎" ["â€"]=> string(5) "‏" ["–"]=> string(7) "–" ["—"]=> string(7) "—" ["‘"]=> string(7) "‘" ["’"]=> string(7) "’" ["‚"]=> string(7) "‚" ["“"]=> string(7) "“" ["â€"]=> string(7) "”" ["„"]=> string(7) "„" ["†"]=> string(8) "†" ["‡"]=> string(8) "‡" ["•"]=> string(6) "•" ["…"]=> string(8) "…" ["‰"]=> string(8) "‰" ["′"]=> string(7) "′" ["″"]=> string(7) "″" ["‹"]=> string(8) "‹" ["›"]=> string(8) "›" ["‾"]=> string(7) "‾" ["â„"]=> string(7) "⁄" ["€"]=> string(6) "€" ["â„‘"]=> string(7) "ℑ" ["℘"]=> string(8) "℘" ["â„œ"]=> string(6) "ℜ" ["â„¢"]=> string(7) "™" ["ℵ"]=> string(9) "ℵ" ["â†"]=> string(6) "←" ["↑"]=> string(6) "↑" ["→"]=> string(6) "→" ["↓"]=> string(6) "↓" ["↔"]=> string(6) "↔" ["↵"]=> string(7) "↵" ["â‡"]=> string(6) "⇐" ["⇑"]=> string(6) "⇑" ["⇒"]=> string(6) "⇒" ["⇓"]=> string(6) "⇓" ["⇔"]=> string(6) "⇔" ["∀"]=> string(8) "∀" ["∂"]=> string(6) "∂" ["∃"]=> string(7) "∃" ["∅"]=> string(7) "∅" ["∇"]=> string(7) "∇" ["∈"]=> string(6) "∈" ["∉"]=> string(7) "∉" ["∋"]=> string(4) "∋" ["âˆ"]=> string(6) "∏" ["∑"]=> string(5) "∑" ["−"]=> string(7) "−" ["∗"]=> string(8) "∗" ["√"]=> string(7) "√" ["âˆ"]=> string(6) "∝" ["∞"]=> string(7) "∞" ["∠"]=> string(5) "∠" ["∧"]=> string(5) "∧" ["∨"]=> string(4) "∨" ["∩"]=> string(5) "∩" ["∪"]=> string(5) "∪" ["∫"]=> string(5) "∫" ["∴"]=> string(8) "∴" ["∼"]=> string(5) "∼" ["≅"]=> string(6) "≅" ["≈"]=> string(7) "≈" ["≠"]=> string(4) "≠" ["≡"]=> string(7) "≡" ["≤"]=> string(4) "≤" ["≥"]=> string(4) "≥" ["⊂"]=> string(5) "⊂" ["⊃"]=> string(5) "⊃" ["⊄"]=> string(6) "⊄" ["⊆"]=> string(6) "⊆" ["⊇"]=> string(6) "⊇" ["⊕"]=> string(7) "⊕" ["⊗"]=> string(8) "⊗" ["⊥"]=> string(6) "⊥" ["â‹…"]=> string(6) "⋅" ["⌈"]=> string(7) "⌈" ["⌉"]=> string(7) "⌉" ["⌊"]=> string(8) "⌊" ["⌋"]=> string(8) "⌋" ["〈"]=> string(6) "⟨" ["〉"]=> string(6) "⟩" ["â—Š"]=> string(5) "◊" ["â™ "]=> string(8) "♠" ["♣"]=> string(7) "♣" ["♥"]=> string(8) "♥" ["♦"]=> string(7) "♦" }

2 个答案:

答案 0 :(得分:1)

你可能想要html_entity_decode。从文档中:&#34;将字符串中的所有HTML实体转换为适用的字符。&#34;根据您的PHP版本和设置,您可能必须手动指定编码。类似的东西:

html_entity_decode($raw_text, ENT_QUOTES, 'UTF-8');

Tidy可能会重新编码您的实体。我不确定您的输入字符串有多复杂,但如果您不需要格式化以完全匹配,可以考虑使用strip_tags等内容删除HTML标记。

答案 1 :(得分:0)

通过改变整洁的编码来完成我的目标

$tidy->parseString($html, $config, 'utf8');

$tidy->parseString($html, $config, 'win1252');

这将重音字符转换为HTML实体。然后我使用html_entity_decode将所有实体转换为UTF-8字符。

新测试文件(有效!)

<?php

header('Content-Type: text/html; charset=UTF-8');

require_once('_RESOURCES/simple_html_dom.php');

$url = 'http://fanfiction.portkey.org/index.php?act=read&storyid=1585&chapterid=&agree=1';

function tidyHTML($html) {
    ob_start();
    $tidy = new tidy;
    $config = array('indent' => true, 'output-xhtml' => false, 'wrap' => 200, 'clean' => false, 'show-body-only' => true);
    $tidy->parseString($html, $config, 'win1252');
    $tidy->cleanRepair();
    $input = $tidy;
    return $input;
}

function filter($html) {
    $html = preg_replace('~>\s+<~', '><', $html);
    $html = preg_replace('/<\/b>\s?<b>/', '', $html);
    $html = preg_replace('/<\/i>\s?<i>/', '', $html);
    $html = str_replace('<br>', '', $html);
    $output = $html;
    return $output;
}

$page_html = file_get_html($url);
$chapter_html = $page_html->find('td[class="story"]', 0);
foreach ($chapter_html->find('center') as $node) { $node->outertext = ''; }

echo filter(html_entity_decode(tidyHTML($chapter_html->innertext)));

?>

没有你,Skunkwaffle就无法做到!