在PHP中解析unicode tweet文本

时间:2013-07-20 17:04:53

标签: php regex twitter unicode

我不确定我的脚本的哪一部分实际上是错误的,但我在解析带有unicode字符的推文文本时遇到了一些困难:

示例推文:

Landsliðsmaður með viti. #rafhlaða #hræddur http://t.co/ci03F3vUNM

当我使用twitteroauth获取它并将其保存到.txt文件时,此字符串会在文件中转换为此内容:

Landsli\u00f0sma\u00f0ur me\u00f0 viti. #rafhla\u00f0a #hr\u00e6ddur http:\/\/t.co\/ci03F3vUNM

我使用简单的preg_replace  用超链接替换文本

function twitterify($ret) {
  $ret = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $ret);
  $ret = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $ret);
  $ret = preg_replace("/@(\w+)/", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $ret);
  $ret = preg_replace("/#(\w+)/", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $ret);
  return $ret;
}

但是只要它击中一个unicode字符就会失败:
#rafhlaða成为<a href="#">#rafhla</a>ða
#hræddur成为<a href="#">#hr</a>æddur
和类似的。

我在哪里做错了?使用PHP保存/打开我的文本文件或解析unicode编码的字符串?

1 个答案:

答案 0 :(得分:1)

看这里,我把u修饰符放在所有正则表达式的末尾,并且它有效。将文件另存为utf8。如果您有json编码的字符串,则可以使用此解决方案对其进行解码:Php/json: decode utf8?

<?php
function ewchar_to_utf8($matches) {
    $ewchar = $matches[1];
    $binwchar = hexdec($ewchar);
    $wchar = chr(($binwchar >> 8) & 0xFF) . chr(($binwchar) & 0xFF);
    return iconv("unicodebig", "utf-8", $wchar);
}

function special_unicode_to_utf8($str) {
    return preg_replace_callback("/\\\u([[:xdigit:]]{4})/i", "ewchar_to_utf8", $str);
}

$text = 'Landsli\u00f0sma\u00f0ur me\u00f0 viti. #rafhla\u00f0a #hr\u00e6ddur http:\/\/t.co\/ci03F3vUNM';
$text = special_unicode_to_utf8($text);

function twitterify($ret) {
  $ret = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t< ]*)#u", "\\1<a href=\"\\2\" target=\"_blank\">\\2</a>", $ret);
  $ret = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r< ]*)#u", "\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>", $ret);
  $ret = preg_replace("/@(\w+)/u", "<a href=\"http://www.twitter.com/\\1\" target=\"_blank\">@\\1</a>", $ret);
  $ret = preg_replace("/#(\w+)/u", "<a href=\"http://search.twitter.com/search?q=\\1\" target=\"_blank\">#\\1</a>", $ret);
  return $ret;
}

$text = twitterify($text);
print $text;

打印:

Landsliðsmaður með viti. <a href="http://search.twitter.com/search?q=rafhlaða" target="_blank">#rafhlaða</a> <a href="http://search.twitter.com/search?q=hræddur" target="_blank">#hræddur</a> http:\/\/t.co\/ci03F3vUNM