缩短文本而不分割单词或破坏html标签

时间:2013-05-16 09:26:53

标签: php

我试图在236个字符之后切断文本,而不会将单词切成两半并保留html标签。这就是我现在正在使用的:

$shortdesc = $_helper->productAttribute($_product, $_product->getShortDescription(), 'short_description');
$lenght = 236;
echo substr($shortdesc, 0, strrpos(substr($shortdesc, 0, $lenght), " "));

虽然这在大多数情况下都有效,但它不会尊重html标签。例如,本文:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>
标签仍处于打开状态时,

将被切断。有没有办法在236个字符之后切断文本但是尊重html标签?

9 个答案:

答案 0 :(得分:15)

我遇到的最佳解决方案来自CakePHP框架TextHelper类

这是方法

/**
* Truncates text.
*
* Cuts a string to the length of $length and replaces the last characters
* with the ending if the text is longer than length.
*
* ### Options:
*
* - `ending` Will be used as Ending and appended to the trimmed string
* - `exact` If false, $text will not be cut mid-word
* - `html` If true, HTML tags would be handled correctly
*
* @param string  $text String to truncate.
* @param integer $length Length of returned string, including ellipsis.
* @param array $options An array of html attributes and options.
* @return string Trimmed string.
* @access public
* @link http://book.cakephp.org/view/1469/Text#truncate-1625
*/
function truncate($text, $length = 100, $options = array()) {
    $default = array(
        'ending' => '...', 'exact' => true, 'html' => false
    );
    $options = array_merge($default, $options);
    extract($options);

    if ($html) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen(strip_tags($ending));
        $openTags = array();
        $truncate = '';

        preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }
    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - mb_strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($html) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }
    $truncate .= $ending;

    if ($html) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}

其他框架可能对此问题有类似(或不同)的解决方案,因此您也可以查看它们。我对Cake的熟悉促使我链接到他们的解决方案

编辑:

刚刚在我正在使用OP的文本

处理的app中测试了这个方法
<?php 
echo truncate(
'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>', 
236, 
array('html' => true, 'ending' => '')); 
?>

输出:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubegre</strong>

请注意,输出会在完成最后一个单词时停止,但包含完整的强标记

答案 1 :(得分:15)

这应该这样做:

class Html
{
    protected
        $reachedLimit = false,
        $totalLen = 0,
        $maxLen = 25,
        $toRemove = array();

    public static function trim($html, $maxLen = 25)
    {

        $dom = new DomDocument();

        if (version_compare(PHP_VERSION, '5.4.0') < 0) {
            $dom->loadHTML($html);
        } else {
            $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
        }

        $instance = new static();
        $toRemove = $instance->walk($dom, $maxLen);

        // remove any nodes that exceed limit
        foreach ($toRemove as $child) {
            $child->parentNode->removeChild($child);
        }

        // remove wrapper tags added by DD (doctype, html...)
        if (version_compare(PHP_VERSION, '5.4.0') < 0) {
            // http://stackoverflow.com/a/6953808/1058140
            $dom->removeChild($dom->firstChild);
            $dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);

            return $dom->saveHTML();
        }

        return $dom->saveHTML();
    }

    protected function walk(DomNode $node, $maxLen)
    {

        if ($this->reachedLimit) {
            $this->toRemove[] = $node;
        } else {
            // only text nodes should have text,
            // so do the splitting here
            if ($node instanceof DomText) {
                $this->totalLen += $nodeLen = strlen($node->nodeValue);

                // use mb_strlen / mb_substr for UTF-8 support
                if ($this->totalLen > $maxLen) {
                    $node->nodeValue = substr($node->nodeValue, 0, $nodeLen - ($this->totalLen - $maxLen)) . '...';
                    $this->reachedLimit = true;
                }
            }

            // if node has children, walk its child elements
            if (isset($node->childNodes)) {
                foreach ($node->childNodes as $child) {
                    $this->walk($child, $maxLen);
                }
            }
        }

        return $this->toRemove;
    }
}

使用方式:$str = Html::trim($str, 236);

(demo here)


这与cakePHP的正则表达式解决方案之间的性能比较

enter image description here

差异非常小,而且在非常大的字符串大小下,DomDocument实际上更快。在我看来,可靠性比节省几微秒更重要。

答案 2 :(得分:1)

我能想一想吗?

示例文字:

Lorem ipsum dolor sit amet, <i class="red">magna aliquyam erat</i>, duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong> hello

首先,将其解析为:

array(
    '0' => array(
        'tag' => '',
        'text' => 'Lorem ipsum dolor sit amet, '
    ),
    '1' => array(
        'tag' => '<i class="red">',
        'text' => 'magna aliquyam erat',
    )
    '2' => ......
    '3' => ......
)

然后逐个剪切文本,并在剪切后用标签包裹每个文本

然后加入他们。

答案 3 :(得分:0)

这将适用于Unicode(来自@nice ass answer):

class Html
{
    protected
        $reachedLimit = false,
        $totalLen = 0,
        $maxLen = 25,
        $toRemove = [];

    public static function trim($html, $maxLen = 25)
    {

        $dom = new \DOMDocument();
        $dom->loadHTML('<?xml encoding="UTF-8">' . $html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

        $instance = new static();
        $toRemove = $instance->walk($dom, $maxLen);

        // remove any nodes that exceed limit
        foreach ($toRemove as $child) {
            $child->parentNode->removeChild($child);
        }

        return $dom->saveHTML();
    }

    protected function walk(\DOMNode $node, $maxLen)
    {

        if ($this->reachedLimit) {
            $this->toRemove[] = $node;
        } else {
            // only text nodes should have text,
            // so do the splitting here
            if ($node instanceof \DOMText) {
                $this->totalLen += $nodeLen = mb_strlen($node->nodeValue);

                // use mb_strlen / mb_substr for UTF-8 support
                if ($this->totalLen > $maxLen) {
                    dump($node->nodeValue);
                    $node->nodeValue = mb_substr($node->nodeValue, 0, $nodeLen - ($this->totalLen - $maxLen)) . '...';
                    $this->reachedLimit = true;
                }
            }

            // if node has children, walk its child elements
            if (isset($node->childNodes)) {
                foreach ($node->childNodes as $child) {
                    $this->walk($child, $maxLen);
                }
            }
        }

        return $this->toRemove;
    }
}

答案 4 :(得分:-1)

function limitStrlen($input, $length, $ellipses = true, $strip_html = true, $skip_html) 
{
    // strip tags, if desired
    if ($strip_html || !$skip_html) 
    {
        $input = strip_tags($input);

        // no need to trim, already shorter than trim length
        if (strlen($input) <= $length) 
        {
            return $input;
        }

        //find last space within length
        $last_space = strrpos(substr($input, 0, $length), ' ');
        if($last_space !== false) 
        {
            $trimmed_text = substr($input, 0, $last_space);
        } 
        else 
        {
            $trimmed_text = substr($input, 0, $length);
        }
    } 
    else 
    {
        if (strlen(strip_tags($input)) <= $length) 
        {
            return $input;
        }

        $trimmed_text = $input;

        $last_space = $length + 1;

        while(true)
        {
            $last_space = strrpos($trimmed_text, ' ');

            if($last_space !== false) 
            {
                $trimmed_text = substr($trimmed_text, 0, $last_space);

                if (strlen(strip_tags($trimmed_text)) <= $length) 
                {
                    break;
                }
            } 
            else 
            {
                $trimmed_text = substr($trimmed_text, 0, $length);
                break;
            }
        }

        // close unclosed tags.
        $doc = new DOMDocument();
        $doc->loadHTML($trimmed_text);
        $trimmed_text = $doc->saveHTML();
    }

    // add ellipses (...)
    if ($ellipses) 
    {
        $trimmed_text .= '...';
    }

    return $trimmed_text;
}

$str = "<h1><strong><span>Lorem</span></strong> <i>ipsum</i> <p class='some-class'>dolor</p> sit amet, consetetur.</h1>";

// view the HTML
echo htmlentities(limitStrlen($str, 22, false, false, true), ENT_COMPAT, 'UTF-8');

// view the result
echo limitStrlen($str, 22, false, false, true);

注意:可能有更好的方法来关闭代码而不是使用DOMDocument。例如,我们可以在p tag内使用h1 tag,但它仍然可以使用。但在这种情况下,标题标记将在p tag之前关闭,因为理论上它不可能在其中使用p tag。所以,要小心HTML的严格标准。

答案 5 :(得分:-2)

您可以采用XML方法并将元素推送到字符串var,直到字符串的长度超过236

示例代码?

for each node // text or tag
  push to the string var

  if string length > 236
    break

endfor

用于解析PHP http://simplehtmldom.sourceforge.net/

中的HTML

答案 6 :(得分:-2)

我在JS中做过,希望这个逻辑在PHP中也会有所帮助..

splitText : function(content, count){
        var originalContent = content;
         content = content.substring(0, count);
          //If there is no occurance of matches before breaking point and the hit breakes in between html tags.
         if (content.lastIndexOf("<") > content.lastIndexOf(">")){
            content = content.substring(0, content.lastIndexOf('<'));
            count = content.length;
            if(originalContent.indexOf("</", count)!=-1){
                content += originalContent.substring(count, originalContent.indexOf('>', originalContent.indexOf("</", count))+1);
            }else{
                 content += originalContent.substring(count, originalContent.indexOf('>', count)+1);
            }
          //If the breaking point is in between tags.
         }else if(content.lastIndexOf("<") != content.lastIndexOf("</")){
            content = originalContent.substring(0, originalContent.indexOf('>', count)+1);
         }
        return content;
    },

希望这种逻辑有助于某些人......

答案 7 :(得分:-2)

这是JS解决方案:trim-html

我们的想法是以这种方式拆分HTML字符串,使数组的元素为html标签(打开或关闭)或只是字符串。

var arr = html.replace(/</g, "\n<")
              .replace(/>/g, ">\n")
              .replace(/\n\n/g, "\n")
              .replace(/^\n/g, "")
              .replace(/\n$/g, "")
              .split("\n");

我们可以遍历数组并计算字符数。

答案 8 :(得分:-2)

仅在正则表达式中使用(在nodeJS中,但我们不在乎语言):

// I want 600 caracters max in my text
const maxLength = 600;
// Get number of caracters to delete
const lgt = txt.length - maxLength;
// we leave X characters to allow the regex to adapt (Important to not only set (.{${lgt}}))
const reg = new RegExp(`[-,;!?.():]?\\s([^\\s]|<[^>]*>)*(.{${lgt},${lgt + 10}})$`, 'i'); // /[-,;!?.():]?\s([^\s]|<[^>]*>)*(.{600,610})$/
// replace all endind caracters by (...)
return txt.replace(reg, ' (...)');

它将替换所有没有剪裁标签的最后的角色,并删除最后的空格。

例如,使用标签为<br />的作品,它将不会削减...

注意:它不会关闭所有打开的标签,不在我最初的询问中