Question

我有一个包含HTML标签的字符串。我正在寻找一段代码，可以让我将这个字符串截断为：

长度为100个字符，
不包含图片代码（<img />）。
包含其他HTML标记（图片标记除外），
100个字符的长度不应包含空格和HTML标记字符。

例如，字符串是：

<img>Something</img><b>Just an Example</b> Plain Text <br><a href="#">stackoverflow</a>

所以结果应该是：

只是一个示例纯文本stackoverflow（它是一个链接）。

结果我们有大约35个单词（白色空间除外）。

我尝试了来自this question的解决方案，但未获得必需的结果。任何帮助将不胜感激。

Answer 1

一个功能怎么样？这是我的 - AbstractHTMLContents。它有两个参数：

输入HTML内容，
限制。

以下是代码：

function AbstractHTMLContents($html, $maxLength=100){
    mb_internal_encoding("UTF-8");
    $printedLength = 0;
    $position = 0;
    $tags = array();
    $newContent = '';

    $html = $content = preg_replace("/<img[^>]+\>/i", "", $html);

    while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];
        // Print text leading up to the tag.
        $str = mb_strcut($html, $position, $tagPosition - $position);
        if ($printedLength + mb_strlen($str) > $maxLength){
            $newstr = mb_strcut($str, 0, $maxLength - $printedLength);
            $newstr = preg_replace('~\s+\S+$~', '', $newstr);  
            $newContent .= $newstr;
            $printedLength = $maxLength;
            break;
        }
        $newContent .= $str;
        $printedLength += mb_strlen($str);
        if ($tag[0] == '&') {
            // Handle the entity.
            $newContent .= $tag;
            $printedLength++;
        } else {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/') {
              // This is a closing tag.
              $openingTag = array_pop($tags);
              assert($openingTag == $tagName); // check that tags are properly nested.
              $newContent .= $tag;
            } else if ($tag[mb_strlen($tag) - 2] == '/'){
          // Self-closing tag.
            $newContent .= $tag;
        } else {
          // Opening tag.
          $newContent .= $tag;
          $tags[] = $tagName;
        }
      }

      // Continue after the tag.
      $position = $tagPosition + mb_strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < mb_strlen($html))
      {
        $newstr = mb_strcut($html, $position, $maxLength - $printedLength);
        $newstr = preg_replace('~\s+\S+$~', '', $newstr);
        $newContent .= $newstr;
      }

    // Close any open tags.
    while (!empty($tags))
      {
        $newContent .= sprintf('</%s>', array_pop($tags));
      }

    return $newContent;
}

看起来，它给出了你期望的结果。

截断包含HTML标记的字符串

1 个答案: