Question

例如，我们有这个xml：

<x>
    <y>some text</y>
    <y>[ID] hello</y>
    <y>world [/ID]</y>
    <y>some text</y>
    <y>some text</y>
</x>

我们需要删除单词“[ID]”，“[/ ID]”和它们之间的文本（解析时我们不知道），当然没有损坏xml格式。

我能想到的唯一解决方案是：

使用正则表达式在xml中查找文本，例如："/\[ID\].*?\[\/ID\]/"。在我们的例子中，结果将是"[ID]hello</y><y>world[/ID]"
在prev步骤的结果中，我们需要使用此正则表达式查找不带xml-tags的文本： "/(?<=^|>)[^><]+?(?=<|$)/"，并删除此文字。结果将是"</y><y>"
通过像这样做smth来改变原始的xml：

str_replace($step1string,$step2string,$xml);

这是正确的方法吗？我只是觉得这个“str_replace”的东西不是编辑xml的最好方法，所以也许你知道更好的解决方案吗？

Answer 1

为了您的娱乐和启发，您可能需要阅读：RegEx match open tags except XHTML self-contained tags

“正确”的解决方案是使用XML库并搜索节点以执行操作。但是，使用str_replace可能要容易得多，即使有可能损坏XML格式。你必须衡量接收<a href="[ID]">之类的东西的可能性以及防范此类案件的重要性，并根据开发时间权衡这些因素。

Answer 2

删除特定字符串很简单：

<?php
$xml = '<x>
    <y>some text</y>
    <y>[ID] hello</y>
    <y>world [/ID]</y>
    <y>some text</y>
    <y>some text</y>
</x>';

$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[(contains(.,\'[ID]\') or contains(.,\'[/ID]\'))]') as $elm){
    $elm->nodeValue = preg_replace('/\[\/?ID\]/','',$elm->nodeValue);
}
var_dump($d->saveXML());
?>

当只删除特定标记中的文本节点时，可以将te preg_replace更改为这两个：

 $elm->nodeValue = preg_replace('/\[ID\].*$/','',$elm->nodeValue);
 $elm->nodeValue = preg_replace('/^.*\[/ID\]/','',$elm->nodeValue);

导致你的例子：

<x>
<y>some text</y>
<y></y>
<y></y>
<y>some text</y>
<y>some text</y>
</x>

但是，在不损坏格式良好的XML的情况下删除其间的标记非常棘手。在冒险进入大量DOM操作之前，您希望如何处理：

DOM树中的[/ ID] 更高：

<foo>[ID] foo
    <bar> lorem [/ID] ipsum </bar>
</foo>

DOM树中的[/ ID] lower

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    [/ID]
</foo>

根据你的例子打开/关闭跨越兄弟姐妹：

<foo> foo
    <bar> lorem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
</foo>

问题的一个真正的破坏者：嵌套可能，嵌套是否良好，它应该做什么？

<foo> foo
    <bar> lo  [ID] rem [ID] ipsum </bar>
    <bar> lorem [/ID] ipsum </bar>
    [/ID]
</foo>

如果没有进一步了解如何处理这些案件，那就没有真正的答案。

编辑，给出了更好的信息，实际的，故障安全的解决方案（即：解析XML，不使用正则表达式）似乎有点长，但将在99.99％的情况下工作（个人错别字和脑卒中除外当然:)）：

<?php
$xml = '<x>
    <y>some text</y>
    <y>
      <a> something </a>
      well [ID] hello
      <a> and then some</a>
    </y>
    <y>some text</y>
    <x>
      world
      <a> also </a>
        foobar [/ID] something
      <a> these nodes </a>
    </x>
    <y>some text</y>
    <y>some text</y>
</x>';
echo $xml;
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
foreach($x->query('//text()[contains(.,\'[ID]\')]') as $elm){
        //if this node also contains [/ID], replace and be done:
        if(($startpos = strpos($elm->nodeValue,'[ID]'))!==false && $endpos = strpos($elm->nodeValue,'[/ID]',$startpos)){
                $elm->replaceData($startpos, $endpos-$startpos + 5,'');
                var_dump($d->saveXML($elm));
                continue;
        }
        //delete all siblings of this textnode not being text and having [/ID]
        while($elm->nextSibling){
                if(!($elm->nextSibling instanceof DOMTEXT) || ($pos =strpos($elm->nodeValue,'[/ID]'))===false){
                        $elm->parentNode->removeChild($elm->nextSibling);
                } else {
                        //id found in same element, replace and go to next [ID]
                        $elm->parentNode->appendChild(new DOMTExt(substr($elm->nextSibling->nodeValue,$pos+5)));
                        $elm->parentNode->removeChild($elm->nextSibling);
                        continue 2;
                }
        }
        //siblings of textnode deleted, string truncated to before [ID], now let's delete intermediate nodes
        while($sibling = $elm->parentNode->nextSibling){ // in case of example: other <y> elements:
                //loop though childnodes and search a textnode with [/ID]
                while($child = $sibling->firstChild){
                        //delete if not a textnode
                        if(!($child instanceof DOMText)){
                                $sibling->removeChild($child);
                                continue;
                        }
                        //we have text, check for [/ID]
                        if(($pos = strpos($child->nodeValue,'[/ID]'))!==false){
                                //add remaining text in textnode:
                                $elm->appendData(substr($child->nodeValue,$pos+5));
                                //remove current textnode with match:
                                $sibling->removeChild($child);
                                //sanity check: [ID] was in <y>, is [/ID]?
                                if($sibling->tagName!= $elm->parentNode->tagname){
                                        trigger_error('[/ID] found in other tag then [/ID]: '.$sibling->tagName.'<>'.$elm->parentNode->tagName, E_USER_NOTICE);
                                }
                                //add remaining childs of sibling to parent of [ID]:
                                while($sibling->firstChild){
                                        $elm->parentNode->appendChild($sibling->firstChild);
                                }
                                //delete the sibling that was found to hold [/ID]
                                $sibling->parentNode->removeChild($sibling);
                                //done: end both whiles
                                break 2;
                        }
                        //textnode, but no [/ID], so remove:
                        $sibling->removeChild($child);
                }
                //no child, no text, so no [/ID], remove:
                $elm->parentNode->parentNode->removeChild($sibling);
        }
}
var_dump($d->saveXML());
?>

Answer 3

我能想到的唯一另一个选择是你可以用不同的方式格式化xml。

<x>
  <y>
    <z>[ID]</z>

复杂编辑xml文件

3 个答案: