拆分包含“转义字符”的短信

时间:2013-11-24 08:56:05

标签: php regex sms

我需要将超过160个字符的短信分成多个部分,这样我就可以发送大量的短信。

有些SMS API会为你分割(它们支持多部分消息),但我正在和几家公司合作,所以我不得不自己拆分消息。

分割信息很简单。我的问题是当SMS消息包含“转义”和“用完”2个字符的字符时该怎么办?

对于那些不知道我在说什么的人:

  

即使在7位编码中,一些字符也被“转义”,这意味着它们“耗尽”了2个字符。在默认的7位编码中,它们是:{}[]\|^~€

来源:https://stackoverflow.com/a/7061794/158126

例如,此字符串 35个字符

  

这笔款项的金额为100欧元。

但是,当通过短信提供商发送时,实际上 36个字符,因为欧元符号被“转义”并占用了两个字符。

关于拆分SMS消息有很多问题,但没有一个问题考虑到这些“转义”字符可能会导致问题。

所以我创造了一个打击这个的功能。我已经对此进行了测试,并且它有效,所以希望它可以帮助其他人。

回到我的问题,我觉得我的代码非常低效。我在循环中运行preg_match几次,我不确定是否有更好的解决方案。

有没有人对如何提高此代码效率有任何建议?

function sms_message_parts($message) {

    // Message parts
    $parts = array();

    // The default encoding is utf16 (unicode) until proven otherwise
    $encoding = 'utf16';

    // Characters that are allowed in 7bit messages
    $gsm_7bit_chars = '@£$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !"#¤%&\'\(\)\*+,-\.\/0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà';

    // Characters that are allowed in 7bit_ex messages
    $gsm_7bit_ex_chars = '\^{}\\\\\[~\]|€';

    // Message lengths
    $message_lengths = array(
        '7bit' => 160,
        '7bit_ex' => 160,
        'utf16' => 70
    );

    // Detect encoding of message
    if (preg_match("/^[" . $gsm_7bit_chars . "]*$/u", $message) == 1)
        $encoding = '7bit';
    elseif (preg_match("/^[" . $gsm_7bit_chars . $gsm_7bit_ex_chars . "]*$/u", $message) == 1)
        $encoding = '7bit_ex';

    // Determine how long each part of the message can be
    $max_parts_length = $message_lengths[$encoding];

    // Length of the message
    $message_length = mb_strlen($message, 'UTF-8');

    // 7bit_ex message
    // Escaped characters found so we need to find the REAL length
    // and split the message differently
    if ($encoding == '7bit_ex') {

        // Count how many extra characters are required a result of
        // the 7bit_ex characters
        $extra_chars = 0;
        for($i=0;$i<$message_length;$i++) {
            if (preg_match("/^[" . $gsm_7bit_ex_chars . "]*$/u", mb_substr($message, $i, 1, 'UTF-8')) == 1)
                $extra_chars++;
        }

        // New message length
        $new_message_length = $message_length + $extra_chars;

        // Is this going to be a multipart message?
        if ($new_message_length > $max_parts_length) {

            // Split the message
            $start = 0;
            while(true) {

                // Determine the length of the split (if it's the last part, we don't need to look for
                // extra "escaped" characters)
                $last_part = false;
                $chars_left = $message_length - $start;
                if ($chars_left < $max_parts_length) {
                    $split_length = $chars_left;
                    $last_part = true;
                } else {
                    $split_length = $max_parts_length;
                }

                // Extract the message part
                $part = mb_substr($message, $start, $split_length, 'UTF-8');

                // Check to see if this part has any escaped characters
                $part_extra_chars = 0;
                if (!$last_part) {
                    for($i=0;$i<$split_length;$i++) {
                        if (preg_match("/^[" . $gsm_7bit_ex_chars . "]*$/u", mb_substr($part, $i, 1, 'UTF-8')) == 1)
                            $part_extra_chars++;
                    }
                }

                // If it has escaped characters, deduct from the amount of characters in this part
                // before adding to the parts array
                if ($part_extra_chars > 0) {

                    $part = mb_substr($message, $start, ($split_length - $part_extra_chars), 'UTF-8');
                    $parts[] = trim($part);
                    $start = $start + ($split_length - $part_extra_chars);

                // No escaped characters, add part to parts array
                } else {

                    $parts[] = trim($part) . ' ' .$split_length;
                    $start = $start + $max_parts_length;

                }

                // We've reached the end of the message
                if ($start >= $message_length)
                    break;

            }

        // It's a signle message
        } else {
            $parts[] = $message;
        }

    // 7bit and utf16 (unicode) messages don't have escaped characters
    } else {

        // Is this going to be a multipart message? Split this part before adding to the
        // parts array
        if ($message_length > $max_parts_length) {

            // Split the message into parts
            $total_messages = ceil($message_length / $max_parts_length);
            $start = 0;
            for($i=0;$i<$total_messages;$i++) {
                $parts[] = trim(mb_substr($message, $start, $max_parts_length, 'UTF-8'));
                $start = $start + $max_parts_length;
            }

        // It's a signle message
        } else {
            $parts[] = $message;
        }

    }

    return array('parts' => $parts, 'encoding' => $encoding);

}

1 个答案:

答案 0 :(得分:0)

如果您只想跳过特定的charcater,那么可以使用正则表达式。

类似于:(?#comment)

正则表达式引擎会忽略(?#和)之间的所有内容。

或者您可以在replace_all使用正则表达式,您可以在其中替换所需的字符而不使用字符''