Word字符串/剪切HTML字符串中的文本

时间:2011-12-12 23:17:47

标签: php html html-parsing word-wrap

这里我想做的是:我有一个包含HTML标签的字符串,我想使用不包含HTML标签的wordwrap函数剪切它。

我被困住了:

public function textWrap($string, $width)
{
    $dom = new DOMDocument();
    $dom->loadHTML($string);
    foreach ($dom->getElementsByTagName('*') as $elem)
    {
        foreach ($elem->childNodes as $node)
        {
            if ($node->nodeType === XML_TEXT_NODE)
            {
                $text = trim($node->nodeValue);
                $length = mb_strlen($text);
                $width -= $length;
                if($width <= 0)
                { 
                    // Here, I would like to delete all next nodes
                    // and cut the current nodeValue and finally return the string 
                }
            }
        }
    }
}

我不确定我现在正以正确的方式做到这一点。我希望它很清楚......

编辑:

这是一个例子。我有这个文字

    <p>
        <span class="Underline"><span class="Bold">Test to be cut</span></span>
   </p><p>Some text</p>

假设我想在第6个字符处剪切它,我想返回它:

<p>
    <span class="Underline"><span class="Bold">Test to</span></span>
</p>

2 个答案:

答案 0 :(得分:3)

正如我在评论中所写,你首先需要找到文本偏移量去哪里做。

首先,我设置一个包含HTML片段的DOMDocument,然后选择在DOM中代表它的主体:

$htmlFragment = <<<HTML
<p>
        <span class="Underline"><span class="Bold">Test to be cut</span></span>
   </p><p>Some text </p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($htmlFragment);
$parent = $dom->getElementsByTagName('body')->item(0);
if (!$parent)
{
    throw new Exception('Parent element not found.');
}

然后我使用我的TextRange类找到需要完成切割的位置,并使用TextRange实际进行切割并找到应该成为切割的DOMNode片段的最后一个节点:

$range = new TextRange($parent);

// find position where to cut the HTML textual represenation
// by looking for a word or the at least matching whitespace
// with a regular expression. 
$width = 17;
$pattern = sprintf('~^.{0,%d}(?<=\S)(?=\s)|^.{0,%1$d}(?=\s)~su', $width);
$r = preg_match($pattern, $range, $matches);
if (FALSE === $r)
{
    throw new Exception('Wordcut regex failed.');
}
if (!$r)
{
    throw new Exception(sprintf('Text "%s" is not cut-able (should not happen).', $range));
}

此正则表达式查找$range提供的文本表示中的切割位置的偏移量。正则表达式模式是inspired by another answer,它更详细地讨论它,并稍作修改以满足此答案需求。

// chop-off the textnodes to make a cut in DOM possible
$range->split($matches[0]);
$nodes = $range->getNodes();
$cutPosition = end($nodes);

因为有可能没有什么可以削减(例如body将变空),我需要处理这个特例。否则 - 如评论中所述 - 需要删除所有以下节点:

// obtain list of elements to remove with xpath
if (FALSE === $cutPosition)
{
    // if there is no node, delete all parent children
    $cutPosition = $parent;
    $xpath = 'child::node()';
}
else
{
    $xpath = 'following::node()';
}

其余部分是直截了当的:查询xpath,删除节点并输出结果:

// execute xpath
$xp = new DOMXPath($dom);
$remove = $xp->query($xpath, $cutPosition);
if (!$remove)
{
    throw new Exception('XPath query failed to obtain elements to remove');
}

// remove nodes
foreach($remove as $node)
{
    $node->parentNode->removeChild($node);
}

// inner HTML (PHP >= 5.3.6)
foreach($parent->childNodes as $node)
{
    echo $dom->saveHTML($node);
}

完整的代码示例是available on viper codepad incl。 TextRange课程。<p> <span class="Underline"><span class="Bold">Test to</span></span></p> 。键盘有一个错误,所以它的结果不正确(相关:XPath query result order)。实际输出如下:

foreach

因此,请注意您有一个当前的libxml版本(通常是这种情况),并且最后的输出saveHTML使用PHP函数$width = 17;,该函数可用于PHP 5.3.6之后的该参数。如果您没有该PHP版本,请采取How to get the xml content of a node as a string?中概述的替代方案或类似问题。

当您仔细查看我的示例代码时,您可能会注意到剪切长度非常大(TextRange)。那是因为文本前面有许多空格字符。这可以通过使正则表达式在其前面删除任意数量的空格和/或首先修剪... $range = new TextRange($parent); $trimmer = new TextRangeTrimmer($range); $trimmer->trim(); ... 来进行调整。第二个选项确实需要更多功能,我写了一些快速的东西,可以在创建初始范围后使用:

TextRangeTrimmer

这将删除HTML片段内左侧和右侧的不必要的空格。 class TextRangeTrimmer { /** * @var TextRange */ private $range; /** * @var array */ private $charlist; public function __construct(TextRange $range, Array $charlist = NULL) { $this->range = $range; $this->setCharlist($charlist); } /** * @param array $charlist list of UTF-8 encoded characters * @throws InvalidArgumentException */ public function setCharlist(Array $charlist = NULL) { if (NULL === $charlist) $charlist = str_split(" \t\n\r\0\x0B") ; $list = array(); foreach($charlist as $char) { if (!is_string($char)) { throw new InvalidArgumentException('Not an Array of strings.'); } if (strlen($char)) { $list[] = $char; } } $this->charlist = array_flip($list); } /** * @return array characters */ public function getCharlist() { return array_keys($this->charlist); } public function trim() { if (!$this->charlist) return; $this->ltrim(); $this->rtrim(); } /** * number of consecutive charcters of $charlist from $start to $direction * * @param array $charlist * @param int $start offset * @param int $direction 1: forward, -1: backward * @throws InvalidArgumentException */ private function lengthOfCharacterSequence(Array $charlist, $start, $direction = 1) { $start = (int) $start; $direction = max(-1, min(1, $direction)); if (!$direction) throw new InvalidArgumentException('Direction must be 1 or -1.'); $count = 0; for(;$char = $this->range->getCharacter($start), $char !== ''; $start += $direction, $count++) if (!isset($charlist[$char])) break; return $count; } public function ltrim() { $count = $this->lengthOfCharacterSequence($this->charlist, 0); if ($count) { $remainder = $this->range->split($count); foreach($this->range->getNodes() as $textNode) { $textNode->parentNode->removeChild($textNode); } $this->range->setNodes($remainder->getNodes()); } } public function rtrim() { $count = $this->lengthOfCharacterSequence($this->charlist, -1, -1); if ($count) { $chop = $this->range->split(-$count); foreach($chop->getNodes() as $textNode) { $textNode->parentNode->removeChild($textNode); } } } } 代码如下:

{{1}}

希望这有用。

答案 1 :(得分:0)

如果DOM解析的使用不是目的而您只需要转换HTML - 请查看此 Gist 中的cot_string_truncate功能。它取自Cotonti CMF。

它也处理了纯文本或HTML。您可以设置长度并选择如何通过限制或单词最近边界来转换文本 - 精确字符。

它正确地将HTML实体和串行空间字符视为一个(在浏览器中查看) - 因此您的示例应该运行良好:

$test_str = "<p>
    <span class=\"Underline\"><span class=\"Bold\">Test to be cut</span></span>
</p><p>Some text</p>";

echo cot_string_truncate($test_str, 8);

结果:

<p>
     <span class="Underline"><span class="Bold">Test to</span></span></p>