Question

我有一个数据库，其中一些元素由HTML特殊字符组成：

| Universidad Tecnol&amp;#243;gica Nacional - UTN                                                  |
| Instituto Tecnol&amp;#243;gico de Buenos Aires                                                   |
| Instituto Superior del Profesorado &amp;quot;Dr. Joaqu&amp;#237;n V. Gonz&amp;#225;lez&amp;quot; |
| Escuela Nacional de N&amp;#225;utica &amp;quot;Manuel Belgrano&amp;quot;                         |
| Conservatorio Nacional de M&amp;#250;sica &amp;quot;Carlos L&amp;#243;pez Buchardo&amp;quot;     |
| Instituto Argentino de Computacion - IAC                                                         |
| Conservatorio de Superior de M&amp;#250;sica &amp;quot;Manuel de Falla&amp;quot;                 |

我需要将其转换为正确的UTF格式。 我可以做的不仅仅是遍历数据库，并且从每个代码映射到等效符号吗？

&amp;#225; -> 'á'
&amp;quot; -> '"'
...

Answer 1

正如my comment above中所提到的，在你自己的情况下你真正想做的事情非常不清楚。

我可以做的不仅仅是遍历数据库，并且从每个代码映射到等效符号吗？

嗯，是的。您可以使用替换字符替换字符代码实体（例如{和ƫ），而无需在“映射”中查找字符代码。但是总是需要查找命名实体（例如"）。

这是我尝试解决一般情况：

创建一个表来存储以HTML格式定义的命名字符实体：

CREATE TABLE ents (
  ref VARCHAR(8) NOT NULL COLLATE utf8_bin,
  rep CHAR(1)    NOT NULL,
  PRIMARY KEY (ref)
);

填充此表 - 我建议使用脚本，例如来自PHP：

$dbh = new PDO("mysql:dbname=$dbname", $username, $password);
$dbh->setAttribute(PDO::ATTR_EMULATE_PREPARES, FALSE);
$ins = $dbh->prepare('INSERT INTO ents (ref, rep) VALUES (?, ?)');
$t = get_html_translation_table(HTML_ENTITIES);
foreach ($t as $k => $v) $ins->execute([substr($v, 1, -1), $k]);

定义一个SQL函数来执行实体替换（在适用的情况下使用此表，或者使用字符代码）：

DELIMITER ;;

CREATE FUNCTION dhe(s TEXT) RETURNS TEXT
BEGIN
  DECLARE n, p, i, t INT DEFAULT 0;
  DECLARE r VARCHAR(12);
  entity_search: LOOP
    SET n := LOCATE('&', s, n+1);
    IF (!n) THEN
      LEAVE entity_search;
    END IF;

    IF (SUBSTRING(s, n+1, 1) = '#') THEN
      CASE
        WHEN SUBSTRING(s, n+2, 1) RLIKE '[[:digit:]]' THEN
          SET t := 2, p := n+2, r := '[[:digit:]]';
        WHEN SUBSTRING(s, n+2, 1) = 'x' THEN
          SET t := 3, p := n+3, r := '[[:xdigit:]]';
        ELSE ITERATE entity_search;
      END CASE;
    ELSE
      SET t := 1, p := n+1, r := '[[:alnum:]_]';
    END IF;

    SET i := 0;
    reference: LOOP
      IF SUBSTRING(s, p+i, 1) NOT RLIKE r THEN
        IF SUBSTRING(s, p+i, 1) RLIKE '[[:alnum:]_]' THEN
          ITERATE entity_search;
        END IF;
        LEAVE reference;
      END IF;
      IF i = 8 THEN ITERATE entity_search; END IF;
      SET i := i + 1;
    END LOOP reference;

    SET s := CONCAT(
      LEFT(s, n-1),
      CASE t
        WHEN 1 THEN COALESCE(
          (SELECT rep FROM ents WHERE ref = SUBSTRING(s, p, i))
        , SUBSTRING(s, n, i + IF(SUBSTRING(s, p+i, 1)=';',1,0))
        )
        WHEN 2 THEN CHAR(SUBSTRING(s, p, i))
        WHEN 3 THEN CHAR(CONV(SUBSTRING(s, p, i), 16, 10))
      END,
      SUBSTRING(s, p + i + IF(SUBSTRING(s, p+i, 1)=';',1,0))
    );
  END LOOP entity_search;
  RETURN s;
END;;

DELIMITER ;

应用此函数两次来解码您的（显然）双重编码表：
```
UPDATE my_table SET my_column = dhe(dhe(my_column));
```

Answer 2

MySQL不提供解码HTML实体的任何功能。 MySQL不关心HTML，也不会特别为它提供任何功能。

如果您不想在数据库中使用这些，则需要在插入字符串之前对其进行解码。如果您使用的是PHP，html_entity_decode() 功能可能就是你要找的东西。

对于已存在于数据库中的那些，您需要创建一个PHP脚本来浏览并读取每一行，处理它，然后用新解码的行替换旧行。

Answer 3

看起来这将是一次性解决方案，所以我不会考虑太多通用的东西。我会找到所有类似的转义符号（将联合替换为对表的引用）：

SELECT DISTINCT IF(@p:=LOCATE('&amp;', s), 
                   SUBSTR(s, @p, LOCATE(';', s, @p+5)-@p+1),
                   NULL) as e_chars
FROM (
SELECT 'Universidad Tecnol&amp;#243;gica Nacional - UTN' as s
UNION ALL
SELECT 'Instituto Tecnol&amp;#243;gico de Buenos Aires'
UNION ALL
SELECT 'Instituto Superior del Profesorado &amp;quot;Dr. Joaqu&amp;#237;n V. Gonz&amp;#225;lez&amp;quot;'
UNION ALL
SELECT 'Escuela Nacional de N&amp;#225;utica &amp;quot;Manuel Belgrano&amp;quot;'
UNION ALL
SELECT 'Conservatorio Nacional de M&amp;#250;sica &amp;quot;Carlos L&amp;#243;pez Buchardo&amp;quot;'
UNION ALL
SELECT 'Instituto Argentino de Computacion - IAC'
UNION ALL
SELECT 'Conservatorio de Superior de M&amp;#250;sica &amp;quot;Manuel de Falla&amp;quot;'
) as s;

接收：

'&amp;#243;'
'&amp;quot;'
'&amp;#225;'
'&amp;#250;'

然后我会编写一个简单的更新查询，如：

UPDATE t1 SET
s = REPLACE(s, '&amp;#243;', 'ó'),
s = REPLACE(s, '&amp;quot;', ''''),
s = REPLACE(s, '&amp;#225;', 'á'),
s = REPLACE(s, '&amp;#250;', 'ú')
WHERE LOCATE(s, '&amp;');

然后你可以重复第一个查询，看看还有什么东西。

Answer 4

你可以通过以下两种方式解决这个问题。

将所有HTML实体数据更新为正确的编码

这是最好的长期解决方案。在html实体之间不断转换会浪费CPU时间。这对于1个查询（100毫秒或更短）来说可能不是很多，但是可以扩展到1,000多个用户，每秒执行几十次，并且很快就会变成有意义的CPU时间。

使用SQL存储过程/函数进行转换。

我过去曾做过几次，这是一个快速修复。好处是您可以重用此功能，但是您必须手动添加要转换的每个HTML实体实例，这会非常繁琐。这是我写的函数。

CREATE DEFINER = `root`@`localhost` FUNCTION `NewProc`(x longtext)
 RETURNS longtext
    NO SQL
    DETERMINISTIC
BEGIN
DECLARE TextString LONGTEXT;
SET TextString = x ;

#quotation mark
IF INSTR( x , '&quot;' )
THEN SET TextString = REPLACE(TextString, '&quot;','"') ;
END IF ;

#apostrophe 
IF INSTR( x , '&apos;' )
THEN SET TextString = REPLACE(TextString, '&apos;','"') ;
END IF ;

#ampersand
IF INSTR( x , '&amp;' )
THEN SET TextString = REPLACE(TextString, '&amp;','&') ;
END IF ;

#less-than
IF INSTR( x , '&lt;' )
THEN SET TextString = REPLACE(TextString, '&lt;','<') ;
END IF ;

#greater-than
IF INSTR( x , '&gt;' )
THEN SET TextString = REPLACE(TextString, '&gt;','>') ;
END IF ;

#non-breaking space
IF INSTR( x , '&nbsp;' )
THEN SET TextString = REPLACE(TextString, '&nbsp;',' ') ;
END IF ;

#inverted exclamation mark
IF INSTR( x , '&iexcl;' )
THEN SET TextString = REPLACE(TextString, '&iexcl;','¡') ;
END IF ;

#cent
IF INSTR( x , '&cent;' )
THEN SET TextString = REPLACE(TextString, '&cent;','¢') ;
END IF ;

#pound
IF INSTR( x , '&pound;' )
THEN SET TextString = REPLACE(TextString, '&pound;','£') ;
END IF ;

#currency
IF INSTR( x , '&curren;' )
THEN SET TextString = REPLACE(TextString, '&curren;','¤') ;
END IF ;

#yen
IF INSTR( x , '&yen;' )
THEN SET TextString = REPLACE(TextString, '&yen;','¥') ;
END IF ;

#broken vertical bar
IF INSTR( x , '&brvbar;' )
THEN SET TextString = REPLACE(TextString, '&brvbar;','¦') ;
END IF ;

#section
IF INSTR( x , '&sect;' )
THEN SET TextString = REPLACE(TextString, '&sect;','§') ;
END IF ;

#spacing diaeresis
IF INSTR( x , '&uml;' )
THEN SET TextString = REPLACE(TextString, '&uml;','¨') ;
END IF ;

#copyright
IF INSTR( x , '&copy;' )
THEN SET TextString = REPLACE(TextString, '&copy;','©') ;
END IF ;

#feminine ordinal indicator
IF INSTR( x , '&ordf;' )
THEN SET TextString = REPLACE(TextString, '&ordf;','ª') ;
END IF ;

#angle quotation mark (left)
IF INSTR( x , '&laquo;' )
THEN SET TextString = REPLACE(TextString, '&laquo;','«') ;
END IF ;

#negation
IF INSTR( x , '&not;' )
THEN SET TextString = REPLACE(TextString, '&not;','¬') ;
END IF ;

#soft hyphen
IF INSTR( x , '&shy;' )
THEN SET TextString = REPLACE(TextString, '&shy;','') ;
END IF ;

#registered trademark
IF INSTR( x , '&reg;' )
THEN SET TextString = REPLACE(TextString, '&reg;','®') ;
END IF ;

#spacing macron
IF INSTR( x , '&macr;' )
THEN SET TextString = REPLACE(TextString, '&macr;','¯') ;
END IF ;

#degree
IF INSTR( x , '&deg;' )
THEN SET TextString = REPLACE(TextString, '&deg;','°') ;
END IF ;

#plus-or-minus 
IF INSTR( x , '&plusmn;' )
THEN SET TextString = REPLACE(TextString, '&plusmn;','±') ;
END IF ;

#superscript 2
IF INSTR( x , '&sup2;' )
THEN SET TextString = REPLACE(TextString, '&sup2;','²') ;
END IF ;

#superscript 3
IF INSTR( x , '&sup3;' )
THEN SET TextString = REPLACE(TextString, '&sup3;','³') ;
END IF ;

#spacing acute
IF INSTR( x , '&acute;' )
THEN SET TextString = REPLACE(TextString, '&acute;','´') ;
END IF ;

#micro
IF INSTR( x , '&micro;' )
THEN SET TextString = REPLACE(TextString, '&micro;','µ') ;
END IF ;

#paragraph
IF INSTR( x , '&para;' )
THEN SET TextString = REPLACE(TextString, '&para;','¶') ;
END IF ;

#middle dot
IF INSTR( x , '&middot;' )
THEN SET TextString = REPLACE(TextString, '&middot;','·') ;
END IF ;

#spacing cedilla
IF INSTR( x , '&cedil;' )
THEN SET TextString = REPLACE(TextString, '&cedil;','¸') ;
END IF ;

#superscript 1
IF INSTR( x , '&sup1;' )
THEN SET TextString = REPLACE(TextString, '&sup1;','¹') ;
END IF ;

#masculine ordinal indicator
IF INSTR( x , '&ordm;' )
THEN SET TextString = REPLACE(TextString, '&ordm;','º') ;
END IF ;

#angle quotation mark (right)
IF INSTR( x , '&raquo;' )
THEN SET TextString = REPLACE(TextString, '&raquo;','»') ;
END IF ;

#fraction 1/4
IF INSTR( x , '&frac14;' )
THEN SET TextString = REPLACE(TextString, '&frac14;','¼') ;
END IF ;

#fraction 1/2
IF INSTR( x , '&frac12;' )
THEN SET TextString = REPLACE(TextString, '&frac12;','½') ;
END IF ;

#fraction 3/4
IF INSTR( x , '&frac34;' )
THEN SET TextString = REPLACE(TextString, '&frac34;','¾') ;
END IF ;

#inverted question mark
IF INSTR( x , '&iquest;' )
THEN SET TextString = REPLACE(TextString, '&iquest;','¿') ;
END IF ;

#multiplication
IF INSTR( x , '&times;' )
THEN SET TextString = REPLACE(TextString, '&times;','×') ;
END IF ;

#division
IF INSTR( x , '&divide;' )
THEN SET TextString = REPLACE(TextString, '&divide;','÷') ;
END IF ;

#capital a, grave accent
IF INSTR( x , '&Agrave;' )
THEN SET TextString = REPLACE(TextString, '&Agrave;','À') ;
END IF ;

#capital a, acute accent
IF INSTR( x , '&Aacute;' )
THEN SET TextString = REPLACE(TextString, '&Aacute;','Á') ;
END IF ;

#capital a, circumflex accent
IF INSTR( x , '&Acirc;' )
THEN SET TextString = REPLACE(TextString, '&Acirc;','Â') ;
END IF ;

#capital a, tilde
IF INSTR( x , '&Atilde;' )
THEN SET TextString = REPLACE(TextString, '&Atilde;','Ã') ;
END IF ;

#capital a, umlaut mark
IF INSTR( x , '&Auml;' )
THEN SET TextString = REPLACE(TextString, '&Auml;','Ä') ;
END IF ;

#capital a, ring
IF INSTR( x , '&Aring;' )
THEN SET TextString = REPLACE(TextString, '&Aring;','Å') ;
END IF ;

#capital ae
IF INSTR( x , '&AElig;' )
THEN SET TextString = REPLACE(TextString, '&AElig;','Æ') ;
END IF ;

#capital c, cedilla
IF INSTR( x , '&Ccedil;' )
THEN SET TextString = REPLACE(TextString, '&Ccedil;','Ç') ;
END IF ;

#capital e, grave accent
IF INSTR( x , '&Egrave;' )
THEN SET TextString = REPLACE(TextString, '&Egrave;','È') ;
END IF ;

#capital e, acute accent
IF INSTR( x , '&Eacute;' )
THEN SET TextString = REPLACE(TextString, '&Eacute;','É') ;
END IF ;

#capital e, circumflex accent
IF INSTR( x , '&Ecirc;' )
THEN SET TextString = REPLACE(TextString, '&Ecirc;','Ê') ;
END IF ;

#capital e, umlaut mark
IF INSTR( x , '&Euml;' )
THEN SET TextString = REPLACE(TextString, '&Euml;','Ë') ;
END IF ;

#capital i, grave accent
IF INSTR( x , '&Igrave;' )
THEN SET TextString = REPLACE(TextString, '&Igrave;','Ì') ;
END IF ;

#capital i, acute accent
IF INSTR( x , '&Iacute;' )
THEN SET TextString = REPLACE(TextString, '&Iacute;','Í') ;
END IF ;

#capital i, circumflex accent
IF INSTR( x , '&Icirc;' )
THEN SET TextString = REPLACE(TextString, '&Icirc;','Î') ;
END IF ;

#capital i, umlaut mark
IF INSTR( x , '&Iuml;' )
THEN SET TextString = REPLACE(TextString, '&Iuml;','Ï') ;
END IF ;

#capital eth, Icelandic
IF INSTR( x , '&ETH;' )
THEN SET TextString = REPLACE(TextString, '&ETH;','Ð') ;
END IF ;

#capital n, tilde
IF INSTR( x , '&Ntilde;' )
THEN SET TextString = REPLACE(TextString, '&Ntilde;','Ñ') ;
END IF ;

#capital o, grave accent
IF INSTR( x , '&Ograve;' )
THEN SET TextString = REPLACE(TextString, '&Ograve;','Ò') ;
END IF ;

#capital o, acute accent
IF INSTR( x , '&Oacute;' )
THEN SET TextString = REPLACE(TextString, '&Oacute;','Ó') ;
END IF ;

#capital o, circumflex accent
IF INSTR( x , '&Ocirc;' )
THEN SET TextString = REPLACE(TextString, '&Ocirc;','Ô') ;
END IF ;

#capital o, tilde
IF INSTR( x , '&Otilde;' )
THEN SET TextString = REPLACE(TextString, '&Otilde;','Õ') ;
END IF ;

#capital o, umlaut mark
IF INSTR( x , '&Ouml;' )
THEN SET TextString = REPLACE(TextString, '&Ouml;','Ö') ;
END IF ;

#capital o, slash
IF INSTR( x , '&Oslash;' )
THEN SET TextString = REPLACE(TextString, '&Oslash;','Ø') ;
END IF ;

#capital u, grave accent
IF INSTR( x , '&Ugrave;' )
THEN SET TextString = REPLACE(TextString, '&Ugrave;','Ù') ;
END IF ;

#capital u, acute accent
IF INSTR( x , '&Uacute;' )
THEN SET TextString = REPLACE(TextString, '&Uacute;','Ú') ;
END IF ;

#capital u, circumflex accent
IF INSTR( x , '&Ucirc;' )
THEN SET TextString = REPLACE(TextString, '&Ucirc;','Û') ;
END IF ;

#capital u, umlaut mark
IF INSTR( x , '&Uuml;' )
THEN SET TextString = REPLACE(TextString, '&Uuml;','Ü') ;
END IF ;

#capital y, acute accent
IF INSTR( x , '&Yacute;' )
THEN SET TextString = REPLACE(TextString, '&Yacute;','Ý') ;
END IF ;

#capital THORN, Icelandic
IF INSTR( x , '&THORN;' )
THEN SET TextString = REPLACE(TextString, '&THORN;','Þ') ;
END IF ;

#small sharp s, German
IF INSTR( x , '&szlig;' )
THEN SET TextString = REPLACE(TextString, '&szlig;','ß') ;
END IF ;

#small a, grave accent
IF INSTR( x , '&agrave;' )
THEN SET TextString = REPLACE(TextString, '&agrave;','à') ;
END IF ;

#small a, acute accent
IF INSTR( x , '&aacute;' )
THEN SET TextString = REPLACE(TextString, '&aacute;','á') ;
END IF ;

#small a, circumflex accent
IF INSTR( x , '&acirc;' )
THEN SET TextString = REPLACE(TextString, '&acirc;','â') ;
END IF ;

#small a, tilde
IF INSTR( x , '&atilde;' )
THEN SET TextString = REPLACE(TextString, '&atilde;','ã') ;
END IF ;

#small a, umlaut mark
IF INSTR( x , '&auml;' )
THEN SET TextString = REPLACE(TextString, '&auml;','ä') ;
END IF ;

#small a, ring
IF INSTR( x , '&aring;' )
THEN SET TextString = REPLACE(TextString, '&aring;','å') ;
END IF ;

#small ae
IF INSTR( x , '&aelig;' )
THEN SET TextString = REPLACE(TextString, '&aelig;','æ') ;
END IF ;

#small c, cedilla
IF INSTR( x , '&ccedil;' )
THEN SET TextString = REPLACE(TextString, '&ccedil;','ç') ;
END IF ;

#small e, grave accent
IF INSTR( x , '&egrave;' )
THEN SET TextString = REPLACE(TextString, '&egrave;','è') ;
END IF ;

#small e, acute accent
IF INSTR( x , '&eacute;' )
THEN SET TextString = REPLACE(TextString, '&eacute;','é') ;
END IF ;

#small e, circumflex accent
IF INSTR( x , '&ecirc;' )
THEN SET TextString = REPLACE(TextString, '&ecirc;','ê') ;
END IF ;

#small e, umlaut mark
IF INSTR( x , '&euml;' )
THEN SET TextString = REPLACE(TextString, '&euml;','ë') ;
END IF ;

#small i, grave accent
IF INSTR( x , '&igrave;' )
THEN SET TextString = REPLACE(TextString, '&igrave;','ì') ;
END IF ;

#small i, acute accent
IF INSTR( x , '&iacute;' )
THEN SET TextString = REPLACE(TextString, '&iacute;','í') ;
END IF ;

#small i, circumflex accent
IF INSTR( x , '&icirc;' )
THEN SET TextString = REPLACE(TextString, '&icirc;','î') ;
END IF ;

#small i, umlaut mark
IF INSTR( x , '&iuml;' )
THEN SET TextString = REPLACE(TextString, '&iuml;','ï') ;
END IF ;

#small eth, Icelandic
IF INSTR( x , '&eth;' )
THEN SET TextString = REPLACE(TextString, '&eth;','ð') ;
END IF ;

#small n, tilde
IF INSTR( x , '&ntilde;' )
THEN SET TextString = REPLACE(TextString, '&ntilde;','ñ') ;
END IF ;

#small o, grave accent
IF INSTR( x , '&ograve;' )
THEN SET TextString = REPLACE(TextString, '&ograve;','ò') ;
END IF ;

#small o, acute accent
IF INSTR( x , '&oacute;' )
THEN SET TextString = REPLACE(TextString, '&oacute;','ó') ;
END IF ;

#small o, circumflex accent
IF INSTR( x , '&ocirc;' )
THEN SET TextString = REPLACE(TextString, '&ocirc;','ô') ;
END IF ;

#small o, tilde
IF INSTR( x , '&otilde;' )
THEN SET TextString = REPLACE(TextString, '&otilde;','õ') ;
END IF ;

#small o, umlaut mark
IF INSTR( x , '&ouml;' )
THEN SET TextString = REPLACE(TextString, '&ouml;','ö') ;
END IF ;

#small o, slash
IF INSTR( x , '&oslash;' )
THEN SET TextString = REPLACE(TextString, '&oslash;','ø') ;
END IF ;

#small u, grave accent
IF INSTR( x , '&ugrave;' )
THEN SET TextString = REPLACE(TextString, '&ugrave;','ù') ;
END IF ;

#small u, acute accent
IF INSTR( x , '&uacute;' )
THEN SET TextString = REPLACE(TextString, '&uacute;','ú') ;
END IF ;

#small u, circumflex accent
IF INSTR( x , '&ucirc;' )
THEN SET TextString = REPLACE(TextString, '&ucirc;','û') ;
END IF ;

#small u, umlaut mark
IF INSTR( x , '&uuml;' )
THEN SET TextString = REPLACE(TextString, '&uuml;','ü') ;
END IF ;

#small y, acute accent
IF INSTR( x , '&yacute;' )
THEN SET TextString = REPLACE(TextString, '&yacute;','ý') ;
END IF ;

#small thorn, Icelandic
IF INSTR( x , '&thorn;' )
THEN SET TextString = REPLACE(TextString, '&thorn;','þ') ;
END IF ;

#small y, umlaut mark
IF INSTR( x , '&yuml;' )
THEN SET TextString = REPLACE(TextString, '&yuml;','ÿ') ;
END IF ;

RETURN TextString ;
END;

转换代码中的实体。

正如JMack所提到的，这个功能内置于PHP，以及其他各种语言，如Ruby，Java等。这是另一个快速修复。

Answer 5

如果我是你，我会双手交叉并注意：

我在进行任何操作之前都会备份。
我做了一个测试，告诉我我是否成功替换。
我记录了我已经完成的查询，因此我可以进行一些UNDO操作。
我不使用任何需要两次的内容，因为它不完整。

我可能会做的是创建一些生成查询的PHP例程，每行一个。检查包含双重编码值的行并生成所需的更改。

这是在WHERE标准中处理每行（主键）的唯一ID，因此知道哪些行被更改，而不是哪些数据。

如果您已完成，则可以针对数据库批量运行此SQL。您也可以针对实际数据库的副本运行它，以便进行某种干运行，并在以后进行一些质量检查。

希望这有帮助。

Answer 6

我必须做同样的事情，虽然我的领域是UTF-8所以＆＃39; ＆＃39; =＆＃39;％20＆＃39;。这是一个简单的修复查询（在我的表上工作......当然你需要自己调整）：

update TABLE set FIELD_NAME=replace(FIELD_NAME,'&amp;#225','á');

我从类似的问题得到了这个问题：Search and replace part of string in database

此外，如果你有少量的角色替换，这仍然很容易做到。

mysql用UTF替代html特殊字符替换

6 个答案: