PHP电报机器人,以UTF-16代码单位提取网址

时间:2018-02-28 17:25:55

标签: utf-16 php-telegram-bot

我的电报bot api有问题。我正在尝试从邮件中提取URL。它以MessageEntity类型编写,偏移量和长度以UTF-16代码单位指定。我已经尝试了很多方法从文本中获取子字符串(使用mb_convert_encoding,iconv,json_encode等),但是我没有得到正确的链接。它适用于没有表情符号的纯文本,但不适用于它们。

1 个答案:

答案 0 :(得分:0)

$output = json_decode(file_get_contents('php://input'), TRUE); 
$message = $output['message']['text'];
$entities = $output['message']['entities'];

function getURLs($message, $entities) { 

    $URLs = [];

    //$message_encode = iconv('utf-8', 'utf-16le', $message); //or utf-16
    $message_encode = mb_convert_encoding($message, "UTF-16", "UTF-8"); //or utf-16le

    foreach ($entities as $entitie) {

        if ($entitie['url']) {
            $URLs[] = $entitie['url'];
        }

        if ($entitie['type']=='url') {
            $URL16 = substr($message_encode, $entitie['offset']*2, $entitie['length']*2);

            //$URLs[] = iconv('utf-16le', 'utf-8', $URL16);
            $URLs[] = mb_convert_encoding($URL16, "UTF-8", "UTF-16");
        }

    }

    return $URLs;

}

$URLs = getURLs($message, $entities);

您可以使用iconv或mb_convert_encoding,UTF-16le或UTF-16。 另请参阅PHP - length of string containing emojis/special chars