Question

我需要在XML中查找元素重复的位置。例如，在XML中如下所示：

<menu>
  <juice sugar="yes" fresh="no">
    <apple/>
    <carrot/>
  </juice>
  <juice sugar="no" fresh="no">
    <apple/>
    <carrot/>
  </juice>
  <juice sugar="no" fresh="no">
    <carrot/>
    <apple/>
  </juice>
  <juice>
    <carrot kind="village" />
    <orange/>
  <juice/>
  <juice>
    <carrot kind="village" />
    <orange/>
    <carrot kind="village" />
  </juice>
</menu>

如果我的XML包含与common parent相同的元素，我需要抛出异常。然而，属性也很重要，应该是平等的。

后代的顺序不应该是重要的。这意味着，在我的示例中，应该抛出异常

  <juice sugar="no" fresh="no">
    <apple/>
    <carrot/>
  </juice>
  <juice sugar="no" fresh="no">
    <carrot/>
    <apple/>
  </juice>

因为“果汁”具有相同的所有属性，而第二个“果汁”的孩子是第一个“果汁”的孩子。

另一个应该发生例外的地方是：

  <juice>
    <carrot kind="village" />
    <orange/>
    <carrot kind="village" />
  </juice>

因为“果汁”含有相同的“胡萝卜”两次。

对于这个问题的任何暗示，我将不胜感激。我应该使用XSLT吗？或者可能最好在C＃中反序列化我的XML？

Answer 1

我首先将每个元素转换为规范形式，将子元素按顺序排序;这很容易用XSLT完成。结果应该是这样两个元素根据你的规则是相等的，当且仅当它们的规范形式是根据XPath 2.0的深度相等（）时。

然后我会编写一个函数来计算每个元素的某种哈希码（以便“相等”的元素具有相同的哈希码）并对该哈希码执行分组。再次使用XSLT 2.0可以轻松完成：唯一困难的是设计散列函数。我怀疑你的例子没有显示真实数据，我想在建议哈希函数之前看到真实数据。

然后在每个哈希码组中，您可以使用XSLT 2.0的deep-equal（）函数来比较组中的每个成员，以消除哈希码匹配是虚假的情况。

Answer 2

下面的XSLT 2.0解决方案恰好适用于您的数据集。如果您有更多数据可以运行它，那将有助于测试它的稳健性。

t:\ftemp>type viktoria.xml 
<?xml version="1.0" encoding="UTF-8"?>
<menu>
  <juice sugar="yes" fresh="no">
    <apple/>
    <carrot/>
  </juice>
  <juice sugar="no" fresh="no">
    <apple/>
    <carrot/>
  </juice>
  <juice sugar="no" fresh="no">
    <carrot/>
    <apple/>
  </juice>
  <juice>
    <carrot kind="village" />
    <orange/>
  </juice>
  <juice>
    <carrot kind="village" />
    <orange/>
    <carrot kind="village" />
  </juice>
</menu>

t:\ftemp>call xslt2 viktoria.xml viktoria.xsl 
<?xml version="1.0" encoding="UTF-8"?>
<exceptions>
   <duplicates>
      <juice sugar="no" fresh="no">
         <apple/>
         <carrot/>
      </juice>
      <juice sugar="no" fresh="no">
         <carrot/>
         <apple/>
      </juice>
   </duplicates>
   <children>
      <juice>
         <carrot kind="village"/>
         <orange/>
         <carrot kind="village"/>
      </juice>
   </children>
</exceptions>

t:\ftemp>type viktoria.xsl 
<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xsd="http://www.w3.org/2001/XMLSchema"
                xmlns:v="urn:X-Viktoria" exclude-result-prefixes="v xsd"
                version="2.0">

<xsl:output indent="yes"/>

<!--return true if the two elements and their attributes are the same while
    ignoring children-->
<xsl:function name="v:shallow-equal" as="xsd:boolean">
  <xsl:param name="elem1" as="element()"/>
  <xsl:param name="elem2" as="element()"/>
  <xsl:sequence select="node-name($elem1)=node-name($elem2) and
    ( every $a1 in $elem1/@* satisfies ( some $a2 in $elem2/@* satisfies 
        ( node-name($a1)=node-name($a2)  and $a1 = $a2 ) ) ) and
    ( every $a2 in $elem2/@* satisfies ( some $a1 in $elem1/@* satisfies 
        ( node-name($a1)=node-name($a2)  and $a1 = $a2 ) ) )"/>
</xsl:function>

<!--return true if two elements have the same children with the same attributes
    while ignoring the children's children-->
<xsl:function name="v:element-and-children-equal" as="xsd:boolean">
  <xsl:param name="elem1" as="element()"/>
  <xsl:param name="elem2" as="element()"/>
  <xsl:sequence
    select="v:shallow-equal($elem1,$elem2) and
            ( every $child1 in $elem1/* satisfies 
                count( $elem2/*[deep-equal(.,$child1)] )=1 ) and
            ( every $child2 in $elem2/* satisfies 
                count( $elem1/*[deep-equal(.,$child2)] )=1 )"/>
</xsl:function>

<!--produce result-->
<xsl:template match="menu">
  <exceptions>
    <duplicates>
      <!--find each element that has a sibling with same children, that is,
          there is more than one such element amongst all siblings-->
      <xsl:for-each 
        select="*[ for $this in . return
              count ( ../*[v:element-and-children-equal(.,$this)] ) > 1 ]">
       <xsl:copy-of select="."/>
      </xsl:for-each>
    </duplicates>
    <children>
      <!--find each element that has duplicate children, that is,
          there is more than one of each child amongst all children-->
      <xsl:for-each
        select="*[ some $child in * satisfies
                   count ( *[deep-equal(.,$child)] ) >1 ]">
       <xsl:copy-of select="."/>
      </xsl:for-each>
    </children>
  </exceptions>
</xsl:template>

</xsl:stylesheet>
t:\ftemp>rem Done!

Answer 3

我不知道如何实现XSLT处理器，然后我不知道用XSLT编写的解决方案的复杂性。

在C＃中，您可以更好地控制复杂性。如果解析树的API很方便，那么你也可以使用更好的算法表示法。

如果我们需要对节点进行排序，那么我们既不需要既不使用非确定性散列函数也不需要深度比较。我们可以使用树的每个级别上的节点顺序作为散列函数。这不是我们通常如何构建哈希函数，但它是确定性的，在这种情况下是有效的（或者我希望如此）。

嗯，下面的代码不是关于XML而是关于C＃，但只是我想提出算法。所以请把它当作一种伪代码。（它是用C ++编写的，用g ++编译，但我没有仔细测试，只有非常简单的测试）。

#include<algorithm>
#include<cstdio>
#include<map>
#include<string>
#include<utility>
#include<vector>

using ::std::make_pair;
using ::std::map;
using ::std::pair;
using ::std::sort;
using ::std::stable_sort;
using ::std::string;
using ::std::vector;

// this is not XML, only some tree with strings to present the algorithm
// moreover:
// we assume that all those strings are short and we can compare them in
// in constant time
// if not, then we could also sort them using some faster algorithm
struct Node {
  string text;
  vector<Node*> children;
  Node() : text(NULL) {}
  explicit Node(string t): text(t) {}
};

struct TextComparer {
  bool operator() (Node* n1, Node* n2) {
    return n1->text < n2->text;
  }
};

struct ChildComparer {
  int child;
  map<Node*, int> &hash;
  ChildComparer(int child_number, map<Node*, int> &children_hash):
      child(child_number), hash(children_hash) {}
  bool operator() (Node* n1, Node* n2) {
    return hash[n1->children[child]] < hash[n2->children[child]];
  }
};

struct AllEqComparer {
  map<Node*, int> &hash;
  AllEqComparer(map<Node*, int> &children_hash): hash(children_hash) {}
  bool eq(Node* n1, Node* n2) {
    if (n1->text != n2->text || n1->children.size() != n2->children.size())
      return false;
    size_t i;
    for (i = 0; i < n1->children.size(); ++i)
      if (hash[n1->children[i]] != hash[n2->children[i]])
        return false;
    return true;
  }
};

struct HashComparer {
  map<Node*, int> &hash;
  HashComparer(map<Node*, int> &node_hash): hash(node_hash) {}
  bool operator() (Node* n1, Node* n2) {
    return hash[n1] < hash[n2];
  }
  bool eq(Node* n1, Node* n2) {
    return hash[n1] == hash[n2];
  }
};


// split the tree on levels
void dfs_levels(Node *root, vector<vector<Node*> > &levels) {
  vector<pair<Node*, int> > stack;
  stack.push_back(make_pair(root, 0));
  Node *node;
  size_t depth, i;
  while (!stack.empty()) {
    node = stack.back().first;
    depth = stack.back().second;
    stack.pop_back();
    // 'if' would be sufficient
    while (levels.size() <= depth) levels.push_back(vector<Node*>());
    levels[depth].push_back(node);
    for (i = 0; i < node->children.size(); ++i)
      stack.push_back(make_pair(node->children[i], depth + 1));
  }
}

// sort in analphabetical order
void sort_level(vector<Node*> &level, map<Node*, int> &hash) {
  // it is convenient to sort each length (i.e. number of children) separately
  // then we don't need to care about complexity of comparing non existing
  // elements
  // e.g. one long sequence and a lot of short sequences gives us O(n^2) non
  // existing elements (where n is sum of the lengths)
  vector<vector<Node*> > buckets;
  for (size_t i = 0; i < level.size(); ++i) {
    while (buckets.size() <= level[i]->children.size())
      buckets.push_back(vector<Node*>());
    buckets[level[i]->children.size()].push_back(level[i]);
  }
  size_t j;
  for (size_t i = 0; i < buckets.size(); ++i) {
    sort(buckets[i].begin(), buckets[i].end(), TextComparer());
    for (j = 0; j < i; ++j)
      stable_sort(buckets[i].begin(), buckets[i].end(), ChildComparer(j, hash));
  }
  size_t k = 0;
  for (size_t i = 0; i < buckets.size(); ++i)
    for (j = 0; j < buckets[i].size(); ++j)
      level[k++] = buckets[i][j];
}


bool is_duplicate_inside(Node *root) {
  vector<vector<Node*> > levels;
  dfs_levels(root, levels);
  map<Node*, int> hash;
  map<Node*, int> new_hash;
  size_t h , j, k;
  for (int i = static_cast<int>(levels.size()) - 1; i >= 0; --i) {
    // For every level of nodes (we count from root) we generate a new
    // hash without collisions

    // we use hash (computed in the previous iteration) for the lower
    // level of nodes
    // in the first iteration (the lowest level) the hash is empty

    // here we sort children of every node on the current level
    HashComparer hash_cmp(hash);
    for (j = 0; j < levels[i].size(); ++j) {
      sort(levels[i][j]->children.begin(), levels[i][j]->children.end(),
          hash_cmp);
      for (k = 1; k < levels[i][j]->children.size(); ++k)
        if (hash_cmp.eq(levels[i][j]->children[k - 1],
            levels[i][j]->children[k]))
          return true;
    }

    // here we sort all nodes on the current level in order to calculate the
    // new hash
    sort_level(levels[i], hash);
    AllEqComparer all_cmp(hash);
    h = 0;
    new_hash[levels[i][0]] = h;
    // if two nodes are the same then they get the same hash
    // if not then we increment the current hash
    for (j = 1; j < levels[i].size(); ++j) {
      if (!all_cmp.eq(levels[i][j - 1], levels[i][j])) ++h;
      new_hash[levels[i][j]] = h;
    }
    hash = new_hash;
    new_hash.clear();
  }
  return false;
}

int main() {
  Node r("food");
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // false
  Node n1("fruit");
  r.children.push_back(&n1);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // false
  Node n2("fruit");
  r.children.push_back(&n2);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // true
  Node n3("orange");
  n2.children.push_back(&n3);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // false
  Node n4("fruit");
  r.children.push_back(&n4);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // true
  Node n5("orange");
  n4.children.push_back(&n5);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // true
  Node n6("banana");
  n2.children.push_back(&n6);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // false
  Node n7("banana");
  n3.children.push_back(&n7);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // false
  Node n8("banana");
  n4.children.push_back(&n8);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // false
  Node n9("orange");
  n4.children.push_back(&n9);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // true
  Node n10("banana");
  n9.children.push_back(&n10);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // false
  Node n11("orange");
  n2.children.push_back(&n11);
  printf("%s\n", is_duplicate_inside(&r) ? "true" : "false");  // true
  return 0;
}

Answer 4

感谢大家的回复。我用C＃和XmlDocument类解决了我的问题。幸运的是，我发现，仅检查具有某些属性的节点就足够了，并且不需要反序列化。我只是递归地检查了每个节点的后代。

如何在XML中找到与公共父元素相同的元素？

4 个答案: