将两个XMLDocument树与重复节点进行比较

时间:2017-08-04 10:39:40

标签: java algorithm xml-parsing

我正在尝试比较两个XML文件并记录所有差异。当节点开始重复时,问题就出现了。对于两个文件:

<root>
    <a/>
    <a/>
    <b/>
</root>

<root>
    <a/>
    <b/>
</root>

我的程序目前没有记录任何差异。 (大而丑)的方法如下:

private void searchDocumentTrees (Node nodeA, Node nodeB, ArrayList<String> differences) {
    if (nodeA.hasChildNodes() && !nodeB.hasChildNodes()) {
        // record A deeper at this node
        return;
    }
    else if (!nodeA.hasChildNodes() && nodeB.hasChildNodes()) {
        // record B deeper at this node
        return;
    }

    else if (!nodeA.hasChildNodes() && !nodeB.hasChildNodes()) {
        return;
    }
    NodeList childrenA = nodeA.getChildNodes();
    NodeList childrenB = nodeB.getChildNodes();

    // indexes of nodes present in both lists of children as 
    // NodeList doesn't allow searching by value
    ArrayList<Integer> presentInBothIndexA = new ArrayList<>();
    ArrayList<Integer> presentInBothIndexB = new ArrayList<>();

    // check for nodes present in both trees, record those present only in A
    for (int indexA = 0; indexA < childrenA.getLength(); indexA++) {
        boolean isPresentInBoth = false;
        Node currentA = childrenA.item(indexA);
        if (currentA.getNodeType() == Node.ELEMENT_NODE) {

            for (int indexB = 0; indexB < childrenB.getLength(); indexB++) {
                Node currentB = childrenB.item(indexB);
                if (currentB.getNodeType() == Node.ELEMENT_NODE) {
                   // if the nodes match, record their indexes and break from inner loop 
                   if (currentA.getNodeName().equals(currentB.getNodeName())) {
                        isPresentInBoth = true;
                        presentInBothIndexA.add(indexA);
                        presentInBothIndexB.add(indexB);
                        break;
                    }
                }
            }

            // if the flag has not been changed currentA is not present in childrenB
            if (!isPresentInBoth) {
                // record as present only in A
            }
        }
    }

    // record nodes present only in B
    for (...){
          /* same nested loop - this time the outer is iterating over B  
          and matching nodes indexes are not recorded - record only B - A */
    }

    for (int indexBoth = 0, len = presentInBothIndexA.size(); indexBoth < len; indexBoth++) {
        Node currentA = childrenA.item(presentInBothIndexA.get(indexBoth));
        Node currentB = childrenB.item(presentInBothIndexB.get(indexBoth));
        searchDocumentTrees(currentA,currentB,differences);
    }



}

我的第一个想法是替换两个文件中出现的计数器的isPresentInBoth标志,但这可能会引入第三个循环,从而进一步增加复杂性。你有更好的主意吗?

1 个答案:

答案 0 :(得分:0)

我找到了两个解决方案:

解决方案1。

尝试各种(低效)方法后,例如计算节点的出现次数并将它们存储在哈希表中我意识到,我有结构存储相同节点的索引。这当然是:

ArrayList<Integer> presentInBothIndexA = new ArrayList<>(); ArrayList<Integer> presentInBothIndexB = new ArrayList<>();

所以只是让它们挂起来,我把它们用于工作:

// pseudo-code for simplification
for(nodeA in fileA) {
    for(nodeB in fileB) {
         // check all the aforementioned conditions 
         if(presentInBothIndexB.contains(indexB))
             continue; // skip if it was already recorded
         // else, do all the other stuff - isPresentInBoth = true, and so on

现在第二个循环不需要内循环:

for (nodeB in B) {
    if (!presentInBothIndexB.contains(indexB))
         //record difference - we only need to look for the nodes, that were skipped
         //by the first loop, i.e. not present in file A

这种方法有它的缺点,因为它按照它们放在文件中的顺序比较节点,所以在这种情况下:

<r>
  <a/>
  <a/>
  <a><b/></a>
</r>

和:

<r>
  <a/>
  <a/>
</r>

它会记录存在不同数量的节点但不会在第一个文件中进行更深入的搜索。这是因为,在将两个节点记录为相同之后,它根本不会进一步查看。这是令人讨厌的,但我想我们可以做出这样的假设。但是,还有比较的属性和价值观,整个事情变得更加混乱和混乱,这给mi带来了:

解决方案2。 只需使用XMLUnit即可。严重。