如何通过保留来自PHP的标记将HTML分成N个部分?

时间:2019-07-19 08:27:16

标签: php domdocument

Google文本语音转换API每个请求都有一个配额-5000个字符。因此,我们有一个HTML页面,该页面需要分成多个部分,并且不超过5000个字符,并且保留单词和html标签。

以下是输入HTML的示例(例如,经过简化):

<div id="myID">
  <span class="test">
    Links in PHP are a means of accessing the contents of one variable under different names.
  </span>
  <span>
    They are not like pointers in C and are not aliases for the symbol table.
  </span>
</div>
<p>
  In PHP, the name of a variable and its contents are different things, so one content can have different names.
</p>

假设我们将文本(仅文本)划分为70个字符的片段,同时保留了标记并且不破坏单词,结果得到:

第1部分

<div id="myID">
  <span class="test">
    Links in PHP are a means of accessing the contents of one variable under
  </span>
</div>

第2部分

<div id="myID">
  <span class="test">
    different names.
  </span>
  <span>
    They are not like pointers in C and are not aliases for the symbol table.
  </span>
</div>

第3部分

<p>
  In PHP, the name of a variable and its contents are different things, so one
</p>

第4部分

<p>
  content can have different names.
</p>

长期以来,存在一个很好的解决方案:

/**
 * Truncates text.
 *
 * Cuts a string to the length of $length and replaces the last characters
 * with the ending if the text is longer than length.
 *
 * @param string  $text String to truncate.
 * @param integer $length Length of returned string, including ellipsis.
 * @param string  $ending Ending to be appended to the trimmed string.
 * @param boolean $exact If true, $text will not be cut mid-word
 * @param boolean $considerHtml If true, HTML tags would be handled correctly
 * @return string Trimmed string.
 */
function str_truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
    if ($considerHtml) {
    // if the plain text is shorter than the maximum length, return the whole text
    if (strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
        return $text;
    }
    // splits all html-tags to scanable lines
    preg_match_all('/(<.+?>)?([^<>]*)/s', $text, $lines, PREG_SET_ORDER);
    $total_length = strlen($ending);
    $open_tags = array();
    $truncate = '';
    foreach ($lines as $line_matchings) {
        // if there is any html-tag in this line, handle it and add it (uncounted) to the output
        if (!empty($line_matchings[1])) {
        // if it's an "empty element" with or without xhtml-conform closing slash (f.e. <br/>)
        if (preg_match('/^<(\s*.+?\/\s*|\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param)(\s.+?)?)>$/is', $line_matchings[1])) {
            // do nothing
            // if tag is a closing tag (f.e. </b>)
        } else if (preg_match('/^<\s*\/([^\s]+?)\s*>$/s', $line_matchings[1], $tag_matchings)) {
            // delete tag from $open_tags list
            $pos = array_search($tag_matchings[1], $open_tags);
            if ($pos !== false) {
            unset($open_tags[$pos]);
            }
        // if tag is an opening tag (f.e. <b>)
        } else if (preg_match('/^<\s*([^\s>!]+).*?>$/s', $line_matchings[1], $tag_matchings)) {
            // add tag to the beginning of $open_tags list
            array_unshift($open_tags, strtolower($tag_matchings[1]));
        }
        // add html-tag to $truncate'd text
        $truncate .= $line_matchings[1];
        }
        // calculate the length of the plain text part of the line; handle entities as one character
        $content_length = strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $line_matchings[2]));
        if ($total_length+$content_length> $length) {
        // the number of characters which are left
        $left = $length - $total_length;
        $entities_length = 0;
        // search for html entities
        if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $line_matchings[2], $entities, PREG_OFFSET_CAPTURE)) {
            // calculate the real length of all entities in the legal range
            foreach ($entities[0] as $entity) {
            if ($entity[1]+1-$entities_length <= $left) {
                $left--;
                $entities_length += strlen($entity[0]);
            } else {
                // no more characters left
                break;
            }
            }
        }
        $truncate .= substr($line_matchings[2], 0, $left+$entities_length);
        // maximum lenght is reached, so get off the loop
        break;
        } else {
        $truncate .= $line_matchings[2];
        $total_length += $content_length;
        }
        // if the maximum length is reached, get off the loop
        if($total_length>= $length) {
        break;
        }
    }
    } else {
    if (strlen($text) <= $length)
        return $text;
    else
        $truncate = substr($text, 0, $length - strlen($ending));
    }
    // if the words shouldn't be cut in the middle...
    if ($exact) {
    // ...search the last occurance of a space...
    $spacepos = strrpos($truncate, ' ');
    if (isset($spacepos)) {
        // ...and cut the text in this position
        $truncate = substr($truncate, 0, $spacepos);
    }
    }
    // add the defined ending to the text
    $truncate .= $ending;
    if($considerHtml) {
    // close all unclosed html-tags
    foreach ($open_tags as $tag) 
        $truncate .= '</' . $tag . '>';
    }
    return $truncate;
}

它的唯一缺点是我们只获得HTML的第一部分。如果不仅可以得到第一部分,而且可以得到结尾,那将是理想的选择。

任何线索挖掘,我将不胜感激。

0 个答案:

没有答案