将多字节字符串截断为n个字符

时间:2010-01-28 11:51:49

标签: php string truncate multibyte

我正在尝试在字符串过滤器中使用此方法:

public function truncate($string, $chars = 50, $terminator = ' …');

我希望这个

$in  = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890";
$out = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …";

还有这个

$in  = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ";
$out = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …";

$chars减去$terminator字符串的字符。

此外,过滤器应该在$chars限制以下的第一个字边界切割,例如

$in  = "Answer to the Ultimate Question of Life, the Universe, and Everything.";
$out = "Answer to the Ultimate Question of Life, the …";

我很确定这应该适用于这些步骤

  • 从最大字符中减去终结符中的字符数量
  • 验证字符串是否长于计算出的限制,或者不加改变地返回
  • 找到字符串中最后一个空格字符,低于计算限制以获取字边界
  • 在最后一个空格处剪切字符串,如果没有找到最后一个空格,则计算限制
  • 将终结符附加到字符串
  • 返回字符串

但是,我现在尝试了str*mb_*函数的各种组合,但都产生了错误的结果。这不是那么困难,所以我显然缺少一些东西。有人会分享这个的工作实现,将我指向一个我最终可以理解如何去做的资源。

由于

P.S。是的,我之前检查了https://stackoverflow.com/search?q=truncate+string+php:)

4 个答案:

答案 0 :(得分:5)

刚刚发现PHP已经有一个带有

的多字节截断

虽然它不遵守词边界。但是很方便!

答案 1 :(得分:3)

试试这个:

function truncate($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

但您需要确保正确设置内部编码。

答案 2 :(得分:0)

我通常不喜欢只编写这样一个问题的完整答案。但是我也醒了,我想也许你的问题可以让我心情愉快去参加今天余下的活动。

我没有试图运行它,但它应该可以工作,或至少让你在那里90%。

function truncate( $string, $chars = 50, $terminate = ' ...' )
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

答案 3 :(得分:0)

tldr;

  • 足够短的字符串不应附加省略号。
  • 换行符也应该是合格的断点。
  • 正则表达式一旦分解并得到解释,就不会太吓人。

我认为关于这个问题和当前的答案有一些重要的事情要指出。我将根据戈登的样本数据和一些其他案例演示答案与我的正则表达式答案的比较,以揭示一些不同的结果。

首先,要澄清输入值的质量。戈登说,该功能必须是多字节安全的,并遵守字边界。样本数据在确定截断位置时并未暴露对非空格,非单词字符(例如标点符号)的期望处理,因此我们必须假设以空格字符为目标已经足够了,并且明智的做法是,因为大多数“阅读更多内容”字符串在截断时不必担心遵守标点符号。

第二,在相当普遍的情况下,必须对包含换行符的大量文本应用省略号。

第三,我们随便同意一些基本的数据标准化,例如:

  • 字符串中的所有前导/尾随空白字符已被修剪
  • $chars的值将始终大于mb_strlen()的{​​{1}}

Demo

功能:

$terminator

测试用例:

function truncateGumbo($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

function truncateGordon($string, $chars = 50, $terminator = ' …') {
    return mb_strimwidth($string, 0, $chars, $terminator);
}

function truncateSoapBox($string, $chars = 50, $terminate = ' …')
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

function truncateMickmackusa($string, $max = 50, $terminator = ' …') {
    $trunc = $max - mb_strlen($terminator, 'UTF-8');
    return preg_replace("~(?=.{{$max}})(?:\S{{$trunc}}|.{0,$trunc}(?=\s))\K.+~us", $terminator, $string);
}

执行:

$tests = [
    [
        'testCase' => "Answer to the Ultimate Question of Life, the Universe, and Everything.",
        // 50th char ---------------------------------------------------^
        'expected' => "Answer to the Ultimate Question of Life, the …",
    ],
    [
        'testCase' => "A single line of text to be followed by another\nline of text",
        // 50th char ----------------------------------------------------^
        'expected' => "A single line of text to be followed by another …",
    ],
    [
        'testCase' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ",
        // 50th char ---------------------------------------------------^
        'expected' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …",
    ],
    [
        'testCase' => "123456789 123456789 123456789 123456789 123456789",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "1234567890123456789012345678901234567890123456789",
    ],
    [
        'testCase' => "Hello worldly world",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "Hello worldly world",
    ],
    [
        'testCase' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890",
        // 50th char ---------------------------------------------------^
        'expected' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …",
    ],
];

输出:

foreach ($tests as ['testCase' => $testCase, 'expected' => $expected]) {
    echo "\tSample Input:\t\t$testCase\n";
    echo "\n\ttruncateGumbo:\t\t" , truncateGumbo($testCase);
    echo "\n\ttruncateGordon:\t\t" , truncateGordon($testCase);
    echo "\n\ttruncateSoapBox:\t" , truncateSoapBox($testCase);
    echo "\n\ttruncateMickmackusa:\t" , truncateMickmackusa($testCase);
    echo "\n\tExpected Result:\t{$expected}";
    echo "\n-----------------------------------------------------\n";
}

我的模式说明:

尽管看起来确实很难看,但是大多数乱码模式语法都是将数字值插入为动态量词的问题。

我也可以这样写:

    Sample Input:           Answer to the Ultimate Question of Life, the Universe, and Everything.

    truncateGumbo:          Answer to the Ultimate Question of Life, the …
    truncateGordon:         Answer to the Ultimate Question of Life, the Uni …
    truncateSoapBox:        Answer to the Ultimate Question of Life, the …
    truncateMickmackusa:    Answer to the Ultimate Question of Life, the …
    Expected Result:        Answer to the Ultimate Question of Life, the …
-----------------------------------------------------
    Sample Input:           A single line of text to be followed by another
line of text

    truncateGumbo:          A single line of text to be followed by …
    truncateGordon:         A single line of text to be followed by another
 …
    truncateSoapBox:        A single line of text to be followed by …
    truncateMickmackusa:    A single line of text to be followed by another …
    Expected Result:        A single line of text to be followed by another …
-----------------------------------------------------
    Sample Input:           âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ

    truncateGumbo:          âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
    truncateGordon:         âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
    truncateSoapBox:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
    truncateMickmackusa:    âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
    Expected Result:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđ …
-----------------------------------------------------
    Sample Input:           123456789 123456789 123456789 123456789 123456789

    truncateGumbo:          123456789 123456789 123456789 123456789 12345678 …
    truncateGordon:         123456789 123456789 123456789 123456789 123456789
    truncateSoapBox:        123456789 123456789 123456789 123456789 …
    truncateMickmackusa:    123456789 123456789 123456789 123456789 123456789
    Expected Result:        1234567890123456789012345678901234567890123456789
-----------------------------------------------------
    Sample Input:           Hello worldly world

    truncateGumbo:          
Warning: mb_strpos(): Offset not contained in string in /in/ibFH5 on line 4
Hello worldly world …
    truncateGordon:         Hello worldly world
    truncateSoapBox:        Hello worldly …
    truncateMickmackusa:    Hello worldly world
    Expected Result:        Hello worldly world
-----------------------------------------------------
    Sample Input:           abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890

    truncateGumbo:          abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateGordon:         abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateSoapBox:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateMickmackusa:    abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    Expected Result:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
-----------------------------------------------------

为简单起见,我将'~(?:\S{' . $trunc . '}|(?=.{' . $max . '}).{0,' . $trunc . '}(?=\s))\K.+~us' 替换为$trunc,将48替换为$max

50