Question

我为这样一个话题标题道歉。但这是因为问题是如此。

现在我正在为Twitter编写解析器，当在推文脚本文本中偶然发现这些符号时，Yii会产生错误：

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x98\x8D\xF0\x9F...' for column 'code' at row 1.

我写了以下代码：

if (preg_match('//si', $texts[$i])) {
 $texts[$i] = str_replace('', '', $texts[$i]); 
}

但它没有帮助我，因为所有这些字符都有不同的Unicode（它们只是方块形式）......

我也写了下面的代码：

        if (preg_match('/xF0/si', $texts[$i])) {
            unset($texts[$i]);
        }

但它也没有帮助我......

这些符号是：✂✃✄✆✇✈✉✌✍✎✏✐✑✒✓✔✕✖✗✘✙✚✛✜✝✞✟✠✡✢✣✤✥✦✧✩✪✫✬✭✮✯✰ ✱✲✳✴✵✶✷✸✹✺✻✼✽✾✿❀❁❂❃❄❅❆❇❈❉❊❋❍❏❐❑❒❖|❙❚❛❜❝❞❡❢❣❤❥❦❧❶❷❸ ❹❺❻❼❽❾❿➀➁➂➃➄➅➆7➇➈➉➊➋➌➍➎➏➐➑➒➓➔➘➙➚➛➜➝➞➟➠➡➢➣➤➥➦➧➨➩➪➫➬ ➭➮➱➱➳➴➵➶➷➸➺➻➼➽➽和许多其他人......

enter image description here

如何从解析后的文本中删除所有这些符号（不使用utf8mb4）？

Answer 1

你太近了。将您的代码与Marc B的评论相结合，我们有：

if (preg_match('/\xF0/si', $texts[$i])) {
  $texts[$i] = preg_replace('/\xF0/si', '', $texts[$i]); 
}

Answer 2

function replace4byte($string) {
    return preg_replace('%(?:
          \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
    )%xs', '', $string);    
}

来自Twitter的解析文本中的方形符号

2 个答案: