我需要在XML中查找元素重复的位置。例如,在XML中如下所示:
<menu>
<juice sugar="yes" fresh="no">
<apple/>
<carrot/>
</juice>
<juice sugar="no" fresh="no">
<apple/>
<carrot/>
</juice>
<juice sugar="no" fresh="no">
<carrot/>
<apple/>
</juice>
<juice>
<carrot kind="village" />
<orange/>
<juice/>
<juice>
<carrot kind="village" />
<orange/>
<carrot kind="village" />
</juice>
</menu>
如果我的XML包含与common parent相同的元素,我需要抛出异常。然而,属性也很重要,应该是平等的。
后代的顺序不应该是重要的。这意味着,在我的示例中,应该抛出异常
<juice sugar="no" fresh="no">
<apple/>
<carrot/>
</juice>
<juice sugar="no" fresh="no">
<carrot/>
<apple/>
</juice>
因为“果汁”具有相同的所有属性,而第二个“果汁”的孩子是第一个“果汁”的孩子。
另一个应该发生例外的地方是:
<juice>
<carrot kind="village" />
<orange/>
<carrot kind="village" />
</juice>
因为“果汁”含有相同的“胡萝卜”两次。
对于这个问题的任何暗示,我将不胜感激。我应该使用XSLT吗?或者可能最好在C#中反序列化我的XML?
答案 0 :(得分:2)
我首先将每个元素转换为规范形式,将子元素按顺序排序;这很容易用XSLT完成。结果应该是这样两个元素根据你的规则是相等的,当且仅当它们的规范形式是根据XPath 2.0的深度相等()时。
然后我会编写一个函数来计算每个元素的某种哈希码(以便“相等”的元素具有相同的哈希码)并对该哈希码执行分组。再次使用XSLT 2.0可以轻松完成:唯一困难的是设计散列函数。我怀疑你的例子没有显示真实数据,我想在建议哈希函数之前看到真实数据。
然后在每个哈希码组中,您可以使用XSLT 2.0的deep-equal()函数来比较组中的每个成员,以消除哈希码匹配是虚假的情况。
答案 1 :(得分:1)
下面的XSLT 2.0解决方案恰好适用于您的数据集。如果您有更多数据可以运行它,那将有助于测试它的稳健性。
t:\ftemp>type viktoria.xml
<?xml version="1.0" encoding="UTF-8"?>
<menu>
<juice sugar="yes" fresh="no">
<apple/>
<carrot/>
</juice>
<juice sugar="no" fresh="no">
<apple/>
<carrot/>
</juice>
<juice sugar="no" fresh="no">
<carrot/>
<apple/>
</juice>
<juice>
<carrot kind="village" />
<orange/>
</juice>
<juice>
<carrot kind="village" />
<orange/>
<carrot kind="village" />
</juice>
</menu>
t:\ftemp>call xslt2 viktoria.xml viktoria.xsl
<?xml version="1.0" encoding="UTF-8"?>
<exceptions>
<duplicates>
<juice sugar="no" fresh="no">
<apple/>
<carrot/>
</juice>
<juice sugar="no" fresh="no">
<carrot/>
<apple/>
</juice>
</duplicates>
<children>
<juice>
<carrot kind="village"/>
<orange/>
<carrot kind="village"/>
</juice>
</children>
</exceptions>
t:\ftemp>type viktoria.xsl
<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:v="urn:X-Viktoria" exclude-result-prefixes="v xsd"
version="2.0">
<xsl:output indent="yes"/>
<!--return true if the two elements and their attributes are the same while
ignoring children-->
<xsl:function name="v:shallow-equal" as="xsd:boolean">
<xsl:param name="elem1" as="element()"/>
<xsl:param name="elem2" as="element()"/>
<xsl:sequence select="node-name($elem1)=node-name($elem2) and
( every $a1 in $elem1/@* satisfies ( some $a2 in $elem2/@* satisfies
( node-name($a1)=node-name($a2) and $a1 = $a2 ) ) ) and
( every $a2 in $elem2/@* satisfies ( some $a1 in $elem1/@* satisfies
( node-name($a1)=node-name($a2) and $a1 = $a2 ) ) )"/>
</xsl:function>
<!--return true if two elements have the same children with the same attributes
while ignoring the children's children-->
<xsl:function name="v:element-and-children-equal" as="xsd:boolean">
<xsl:param name="elem1" as="element()"/>
<xsl:param name="elem2" as="element()"/>
<xsl:sequence
select="v:shallow-equal($elem1,$elem2) and
( every $child1 in $elem1/* satisfies
count( $elem2/*[deep-equal(.,$child1)] )=1 ) and
( every $child2 in $elem2/* satisfies
count( $elem1/*[deep-equal(.,$child2)] )=1 )"/>
</xsl:function>
<!--produce result-->
<xsl:template match="menu">
<exceptions>
<duplicates>
<!--find each element that has a sibling with same children, that is,
there is more than one such element amongst all siblings-->
<xsl:for-each
select="*[ for $this in . return
count ( ../*[v:element-and-children-equal(.,$this)] ) > 1 ]">
<xsl:copy-of select="."/>
</xsl:for-each>
</duplicates>
<children>
<!--find each element that has duplicate children, that is,
there is more than one of each child amongst all children-->
<xsl:for-each
select="*[ some $child in * satisfies
count ( *[deep-equal(.,$child)] ) >1 ]">
<xsl:copy-of select="."/>
</xsl:for-each>
</children>
</exceptions>
</xsl:template>
</xsl:stylesheet>
t:\ftemp>rem Done!
答案 2 :(得分:0)
我不知道如何实现XSLT处理器,然后我不知道用XSLT编写的解决方案的复杂性。
在C#中,您可以更好地控制复杂性。如果解析树的API很方便,那么你也可以使用更好的算法表示法。
如果我们需要对节点进行排序,那么我们既不需要既不使用非确定性散列函数也不需要深度比较。我们可以使用树的每个级别上的节点顺序作为散列函数。这不是我们通常如何构建哈希函数,但它是确定性的,在这种情况下是有效的(或者我希望如此)。
嗯,下面的代码不是关于XML而是关于C#,但只是我想提出算法。所以请把它当作一种伪代码。 (它是用C ++编写的,用g ++编译,但我没有仔细测试,只有非常简单的测试)。
#include<algorithm>
#include<cstdio>
#include<map>
#include<string>
#include<utility>
#include<vector>
using ::std::make_pair;
using ::std::map;
using ::std::pair;
using ::std::sort;
using ::std::stable_sort;
using ::std::string;
using ::std::vector;
// this is not XML, only some tree with strings to present the algorithm
// moreover:
// we assume that all those strings are short and we can compare them in
// in constant time
// if not, then we could also sort them using some faster algorithm
struct Node {
string text;
vector<Node*> children;
Node() : text(NULL) {}
explicit Node(string t): text(t) {}
};
struct TextComparer {
bool operator() (Node* n1, Node* n2) {
return n1->text < n2->text;
}
};
struct ChildComparer {
int child;
map<Node*, int> &hash;
ChildComparer(int child_number, map<Node*, int> &children_hash):
child(child_number), hash(children_hash) {}
bool operator() (Node* n1, Node* n2) {
return hash[n1->children[child]] < hash[n2->children[child]];
}
};
struct AllEqComparer {
map<Node*, int> &hash;
AllEqComparer(map<Node*, int> &children_hash): hash(children_hash) {}
bool eq(Node* n1, Node* n2) {
if (n1->text != n2->text || n1->children.size() != n2->children.size())
return false;
size_t i;
for (i = 0; i < n1->children.size(); ++i)
if (hash[n1->children[i]] != hash[n2->children[i]])
return false;
return true;
}
};
struct HashComparer {
map<Node*, int> &hash;
HashComparer(map<Node*, int> &node_hash): hash(node_hash) {}
bool operator() (Node* n1, Node* n2) {
return hash[n1] < hash[n2];
}
bool eq(Node* n1, Node* n2) {
return hash[n1] == hash[n2];
}
};
// split the tree on levels
void dfs_levels(Node *root, vector<vector<Node*> > &levels) {
vector<pair<Node*, int> > stack;
stack.push_back(make_pair(root, 0));
Node *node;
size_t depth, i;
while (!stack.empty()) {
node = stack.back().first;
depth = stack.back().second;
stack.pop_back();
// 'if' would be sufficient
while (levels.size() <= depth) levels.push_back(vector<Node*>());
levels[depth].push_back(node);
for (i = 0; i < node->children.size(); ++i)
stack.push_back(make_pair(node->children[i], depth + 1));
}
}
// sort in analphabetical order
void sort_level(vector<Node*> &level, map<Node*, int> &hash) {
// it is convenient to sort each length (i.e. number of children) separately
// then we don't need to care about complexity of comparing non existing
// elements
// e.g. one long sequence and a lot of short sequences gives us O(n^2) non
// existing elements (where n is sum of the lengths)
vector<vector<Node*> > buckets;
for (size_t i = 0; i < level.size(); ++i) {
while (buckets.size() <= level[i]->children.size())
buckets.push_back(vector<Node*>());
buckets[level[i]->children.size()].push_back(level[i]);
}
size_t j;
for (size_t i = 0; i < buckets.size(); ++i) {
sort(buckets[i].begin(), buckets[i].end(), TextComparer());
for (j = 0; j < i; ++j)
stable_sort(buckets[i].begin(), buckets[i].end(), ChildComparer(j, hash));
}
size_t k = 0;
for (size_t i = 0; i < buckets.size(); ++i)
for (j = 0; j < buckets[i].size(); ++j)
level[k++] = buckets[i][j];
}
bool is_duplicate_inside(Node *root) {
vector<vector<Node*> > levels;
dfs_levels(root, levels);
map<Node*, int> hash;
map<Node*, int> new_hash;
size_t h , j, k;
for (int i = static_cast<int>(levels.size()) - 1; i >= 0; --i) {
// For every level of nodes (we count from root) we generate a new
// hash without collisions
// we use hash (computed in the previous iteration) for the lower
// level of nodes
// in the first iteration (the lowest level) the hash is empty
// here we sort children of every node on the current level
HashComparer hash_cmp(hash);
for (j = 0; j < levels[i].size(); ++j) {
sort(levels[i][j]->children.begin(), levels[i][j]->children.end(),
hash_cmp);
for (k = 1; k < levels[i][j]->children.size(); ++k)
if (hash_cmp.eq(levels[i][j]->children[k - 1],
levels[i][j]->children[k]))
return true;
}
// here we sort all nodes on the current level in order to calculate the
// new hash
sort_level(levels[i], hash);
AllEqComparer all_cmp(hash);
h = 0;
new_hash[levels[i][0]] = h;
// if two nodes are the same then they get the same hash
// if not then we increment the current hash
for (j = 1; j < levels[i].size(); ++j) {
if (!all_cmp.eq(levels[i][j - 1], levels[i][j])) ++h;
new_hash[levels[i][j]] = h;
}
hash = new_hash;
new_hash.clear();
}
return false;
}
int main() {
Node r("food");
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // false
Node n1("fruit");
r.children.push_back(&n1);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // false
Node n2("fruit");
r.children.push_back(&n2);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // true
Node n3("orange");
n2.children.push_back(&n3);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // false
Node n4("fruit");
r.children.push_back(&n4);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // true
Node n5("orange");
n4.children.push_back(&n5);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // true
Node n6("banana");
n2.children.push_back(&n6);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // false
Node n7("banana");
n3.children.push_back(&n7);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // false
Node n8("banana");
n4.children.push_back(&n8);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // false
Node n9("orange");
n4.children.push_back(&n9);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // true
Node n10("banana");
n9.children.push_back(&n10);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // false
Node n11("orange");
n2.children.push_back(&n11);
printf("%s\n", is_duplicate_inside(&r) ? "true" : "false"); // true
return 0;
}
答案 3 :(得分:0)
感谢大家的回复。我用C#和XmlDocument类解决了我的问题。幸运的是,我发现,仅检查具有某些属性的节点就足够了,并且不需要反序列化。我只是递归地检查了每个节点的后代。