Question

我有一个大文件，发现其中两次包含某些元素，现在我想删除重复的元素。有什么想法我能做什么？不胜感激！

xml看起来像这样：

<Toptag>
<text coordinates="" country="" date="yyyy-mm-dd" lang="" place="xyc" time="" id=" 123"  name="xyz" >
<div>
This is text
</div>
</text>
<text coordinates="" country="" date="yyyy-mm-dd" lang="" place="xyc" 
time="" id=" 124"  name="xyz" >
<div>
This is text
</div>
</text>
<text coordinates="" country="" date="yyyy-mm-dd" lang="" place="xyc"         time="" id=" 123"  name="xyz" >
<div>
This is text
</div>
</text>
....
</toptag>

在重复项中，<text...............> <div> </div> </text>中的所有内容都完全相同！

谢谢!!!!!!

Answer 1

假设您至少使用XSLT 2，则可以访问deep-equal函数https://www.w3.org/TR/xpath-functions/#func-deep-equal，因此可以编写一个空模板

  <xsl:template match="Toptag/text[some $sib in preceding-sibling::text satisfies deep-equal(., $sib)]"/>

与身份转换一起使用（例如在XSLT 3中使用适当的xsl:mode声明，或者在XSLT 2中通过拼写）：

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="3.0">

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template match="Toptag/text[some $sib in preceding-sibling::text satisfies deep-equal(., $sib)]"/>

</xsl:stylesheet>

这样，不会复制具有高度相等的前一个兄弟text的那些text元素：https://xsltfiddle.liberty-development.net/94hvTzF

很明显，可以调整谓词中的条件以检查所有先前的节点。

Answer 2

如果您可以定义一个函数f：signature（element（text）），当且仅当两个元素相等时，它们才为两个元素返回相同的值，那么您可以使用XSLT 2.0分组来消除重复项：

<xsl:for-each-group select="text" group-by="f:signature(.)">
  <xsl:copy-of select="current-group()[1]"/>
</xsl:for-each-group>

如果元素具有非常不同的结构，则可能很难编写签名函数。但是，如果它们都非常相似（如您的示例所示），则可以使用例如

<xsl:function name="f:signature" as="xs:string">
  <xsl:param name="e" as="element(text)"/>
  <xsl:sequence select="string-join($e!(@coordinates, @country, @date, @lang, @place, string(.)), '|')"/>
</xsl:function>

注意：我使用XSLT 3.0“！”运算符，因为您不希望将属性按文档顺序排序（属性的文档顺序是不可预测的）。在2.0中，“！”不可用，您可以将其拼写为($e/@coordinates, $e/@country, $e/@date, ...)。

删除xml中的重复元素

2 个答案: