编辑有关 allReviews 字段

Question

我正在尝试从xml中提取csv。叶元素从整个xml中命名形成标题行，所有相应的文本值都是数据行。如果节点中不存在给定的叶元素，则打印空白值。下面的示例xml和输出将解释我正在尝试做什么。

输入XML：

<?xml version="1.0" encoding="utf-8"?>
<itemList>
    <item>
        <userID>123</userID>
        <userName>ABC</userName>
        <orders SINGLE="Y">
            <order>
                <orderID>0000377T</orderID>
                <orderType>online</orderType>
            </order>
        </orders>
        <details SINGLE="Y">
            <detail>
                <color>black</color>
                <make>pluto</make>
            </detail>
        </details>
        <addresses SINGLE="N">
            <address>
                <addrID>000111NR</addrID>
                <addrName>HOME</addrName>
            </address>
            <address>
                <addrID>000111ST</addrID>
                <addrName>OFFICE</addrName>
                <comment>HQ</comment>
            </address>
        </addresses>
    </item>
    <item>
        <userID>456</userID>
        <userName>DEF</userName>
        <orders SINGLE="Y">
            <order>
                <orderID>0000377T</orderID>
                <orderType>phone</orderType>
            </order>
        </orders>
        <details SINGLE="Y">
            <detail>
                <color>red</color>
            </detail>
        </details>
        <addresses SINGLE="N">
            <address>
                <addrID>000222NR</addrID>
                <addrName>HOME</addrName>
            </address>
            <address>
                <delivery>am</delivery>
                <addrID>000222ST</addrID>
                <addrName>OFFICE</addrName>
            </address>
        </addresses>
    </item>
</itemList>

预期产出：

userID,userName,orderID,orderType,color,make,addrID,addrName,addrID,addrName,comment,delivery
123,ABC,0000377T,online,black,pluto,000111NR,HOME,000111ST,OFFICE,HQ,
456,DEF,0000377T,phone,red,,000222NR,HOME,000222ST,OFFICE,,am

到目前为止我能够构建的XSLT：

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
      <xsl:output method="text" />
      <xsl:strip-space elements="*" />
      <xsl:variable name="newLine" select="'&#xA;'" />
      <xsl:variable name="delimiter" select="','" />
      <xsl:key name="field" match="//*[not(*)]" use="local-name()" />
      <xsl:variable name="allFields" select="//*[generate-id()=generate-id(key('field', local-name())[1])]" />
      <xsl:template match="/">
<!-- print the header line -->
        <xsl:for-each select="$allFields">
          <xsl:value-of select="local-name()" />
          <xsl:if test="position() &lt; last()">
            <xsl:value-of select="$delimiter" />
          </xsl:if>
        </xsl:for-each>
        <xsl:value-of select="$newLine" />
        <xsl:apply-templates />
      </xsl:template>
      <xsl:template match="item">
        <xsl:if test="position()!=1">
          <xsl:value-of select="$newLine" />
        </xsl:if>
        <xsl:apply-templates select="descendant::*[not(*)]" mode="pass" />
      </xsl:template>
      <xsl:template match="*" mode="pass">
        <xsl:if test="position()!=1">
          <xsl:value-of select="$delimiter" />
        </xsl:if>
        <xsl:variable name="this" select="." />
        <xsl:for-each select="$allFields">
          <xsl:value-of select="$this[local-name() = local-name(current())]" />
        </xsl:for-each>
      </xsl:template>
    </xsl:stylesheet>

输出我在上面显示的xml：

执行时上面的XSLT

用户ID，用户名，的orderID，订单类型，颜色，构造addrID，addrName，评论，递送

123，ABC，0000377T，在线，黑色，冥王星，000111NR，HOME，000111ST，OFFICE，HQ

456，DEF，0000377T，电话，红色，000222NR，HOME，AM，000222ST，OFFICE

此结果中的问题是： 1.没有空白空间正在打印不存在的叶元素。 2.header行只包含一组addrID，addrName而我的输入xml包含2组。 3.即使我在XSLT的开头使用了条带空间，每行之后的输出中都会打印一个空行。

如上所示，您可以帮助实现所需的输出吗？非常感谢。

Answer 1

让我们从解决方案中的一些差异开始。

您将标题行写为所有字段名称没有重复。

但是您的示例数据显示，一个项目可以包含几个叶节点（数据字段）同名（不仅仅是1）。所以数据行可以包含比标题行更多的项目，你会弄得一团糟你不知道，哪个标题涉及特定领域。

所以我们必须以正确的方式开始组装标题行。

要了解每个字段名称应重复多少次，您应该为每个字段名称：

计算每个项
从这些数字中取最大值。

作为一个结果，我们得到 reptNums 数组 - 来自 allFields 的各个字段的重复数字。

让我们继续讨论如何组装每个数据行。

应为每个项执行以下步骤。

对于每个字段名称：

获取具有此名称的叶节点，但只有那些具有当前项作为祖先的
为每个此类节点打印其值和逗号，
如果实际值的数量小于相应的 reptNum ，则打印额外的空值（实际上只有逗号）。

以这种方式组装的文本在最后一个字段后面包含一个逗号，因此它汇编在一个变量（ row ）中然后实际输出会切断最后一个字符。

整个解决方案包含在下面（在XSLT第2版中）。

我使用 Saxon HE 引擎在http://xsltransform.net上测试了它。

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <!-- Global variables -->
  <xsl:variable name="newLine" select="'&#xA;'"/>
  <xsl:variable name="comma" select="','"/>

  <!-- Field names (without repetitions) -->
  <xsl:key name="field" match="//*[not(*)]" use="local-name()"/>
  <xsl:variable name="allFields" select="//*[generate-id()=generate-id(key('field', local-name())[1])]" />
  <!-- # of field names -->
  <xsl:variable name="fieldCnt" select="count($allFields)"/>
  <!-- Generated IDs for items -->
  <xsl:variable name="itemIds" select="//item/generate-id()"/>
  <!-- Repetition numbers for field names -->
  <xsl:variable name="reptNums" as="xs:integer*">
    <xsl:for-each select="$allFields">
      <!-- Get fields for current name -->
      <xsl:variable name="fields" select="key('field', local-name())"/>
      <!-- How many times does this field occur in each item? -->
      <xsl:variable name="nums" as="xs:integer*">
        <xsl:for-each select="$itemIds">
          <xsl:variable name="itemId" select="."/>
          <xsl:value-of select="count($fields[generate-id(ancestor::item)=$itemId])"/>
        </xsl:for-each>
      </xsl:variable>
      <!-- Return max value -->
      <xsl:value-of select="xs:integer(max($nums))"/>
    </xsl:for-each>
  </xsl:variable>

  <xsl:template match="/">
    <!-- Create array of header items -->
    <xsl:variable name="headers" as="xs:string*">
      <xsl:for-each select="1 to $fieldCnt">
        <xsl:variable name="index" select="."/>
        <!-- Name of the current field -->
        <xsl:variable name="fieldName" select="$allFields[$index][1]/local-name()"/>
        <!-- Repeat the field name respective number of times -->
        <xsl:for-each select="1 to $reptNums[$index]">
          <xsl:value-of select="$fieldName"/>
        </xsl:for-each>
      </xsl:for-each>
    </xsl:variable>
    <!-- Print the header line -->
    <xsl:value-of select="string-join($headers,',')"/>
    <!-- Actually - proces items -->
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="item">
    <!-- Terminate the previous row -->
    <xsl:value-of select="$newLine"/>
    <!-- Generate data row for current item -->
    <!-- Generated ID -->
    <xsl:variable name="itemId" select="generate-id()"/>
    <!-- Assemble the output row -->
    <xsl:variable name="row">
      <xsl:for-each select="$allFields">
        <!-- Name of the current field -->
        <xsl:variable name="fieldName" select=".[1]/local-name()"/>
        <!-- Fields for the current name, but only from this item -->
        <xsl:variable name="fields"
          select="key('field', $fieldName)[generate-id(ancestor::item)=$itemId]"/>
        <!-- Write values found -->
        <xsl:for-each select="$fields">
          <xsl:value-of select="."/>
          <xsl:value-of select="$comma"/>
        </xsl:for-each>
        <!-- Which reptNum take for the current field? -->
        <xsl:variable name="index" select="position()"/>
        <!-- Write extra commas -->
        <xsl:for-each select="1 to xs:integer($reptNums[$index] - count($fields))">
          <xsl:value-of select="$comma"/>
        </xsl:for-each>
      </xsl:for-each>
    </xsl:variable>
    <!-- Print the row, but without the last comma -->
    <xsl:value-of select="substring($row, 1, string-length($row) - 1)"/>
  </xsl:template>

</xsl:stylesheet>

Answer 2

在对我的第一个答案的评论中，你询问了教学：

<xsl:variable name="fields"
  select="key('field', $fieldName)[generate-id(ancestor::item)=$itemId]"/>

让我们从使用过的变量开始：

fieldName - 当前字段的名称（例如userName），
itemId - 为当前项目生成的ID。

现在让我们以 userName 字段为例检查特定部分：

key('field', $fieldName) - 来自名称为 fields 的键，读取 userName 下保存的所有字段的序列。

但是此序列包含来自所有项目的 userName 节点，因此我们必须缩小此选择范围带谓词：

[generate-id(ancestor::item)=$itemId]

让我们检查每个部分：

ancestor::item - 返回包含此 userName 的 item 节点，
generate-id(...) - 获取为此项生成的节点ID，
=$itemId - 我们要求上述ID等于当前项的ID。

结果，我们得到（作为状态之前的评论）：当前名称的字段例如 userName ），但仅来自此项目。

然后，在以下for-each循环中，将写入这些字段（每个字段后面都有逗号）。

为什么脚本会打印额外的逗号：

考虑 addrID 字段的示例。

例如每个项中最多两个 addrIDs ，因此标题行包含2个 addrID 标题。

为了与标题行保持一致，对于每个项目，我们必须输出两个值。

但是，例如一个项目只包含1个 addrID ，然后：

我们会在此项目中打印此（仅限1） addrID ，
没有第二个值，所以我们必须打印一个空值＆＃34; - 只有一个逗号。

否则，在第二个 addrID 标题下，您将获得下一个字段名称的值，在本例中为 addrName 。

关于您的问题＃3和4，我建议：

XSLT很可能在大容量数据上运行缓慢。我的建议是你从XSLT 1改为XSLT 2。

请记住，XSLT 1具有较小的功能集。

E.g。 string-join函数仅在XSLT 2中引入。当然，代替string-join，您可以使用逗号的条件打印输出for-each循环（不要在最后一个值之后打印）。

但是这段代码运行得更慢，可能超出了接受的门槛。

因此存在风险，您需要花费很多精力在XSLT 1中重写此脚本，只是为了发现它运行得太慢并且您仍然必须返回到XSLT 2。

关于大数据的另一个建议：

在小样本上尝试此脚本，然后在越来越大的样本上尝试。

通过这种方式，您可以评估它对较大输入数据的操作时间。

编辑有关 allReviews 字段

的其他要求

变化不是很复杂。

要改变的第一件事是如何创建标题行。

需要进行两次更正：

创建不带审核节点的密钥字段（谓词现在包含和./local-name（）！=＆＃39; review＆＃39; < / em>的）。

在标题行中附加逗号和 allReviews 。

现在，在 item 模板中，所有字段的循环仍然会收集字段值，但没有审核节点，因为 allFields < / em>不包含此名称。

审核字段在所有＆＃34;常规＆＃34;之后添加领域。我使用了您的备注，评论标记是地址的兄弟，换句话说，是 item 的直接子项。这样我就使用了一个显式的XPath，它可能更快地运行（一般的性能提示是避免＆＃34; //＆＃34;在XPath中）。

最后一个更改：由于不需要切断任何逗号，输出可以直接写入输出文件（不需要使用任何中间变量）。

请参阅下面的完整解决方案。

<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:output method="text"/> <xsl:strip-space elements="*"/>  <xsl:variable name="newLine" select="'
'"/> <xsl:variable name="comma" select="','"/> <xsl:variable name="pipe" select="'|'"/>  <xsl:key name="field" match="//*[not(*) and ./local-name()!='review']" use="local-name()"/>  <xsl:variable name="allFields" select="//*[generate-id()=generate-id(key('field', local-name())[1])]" />  <xsl:variable name="fieldCnt" select="count($allFields)"/>  <xsl:variable name="itemIds" select="//item/generate-id()"/>  <xsl:variable name="reptNums" as="xs:integer*"> <xsl:for-each select="$allFields">  <xsl:variable name="fields" select="key('field', local-name())"/>  <xsl:variable name="nums" as="xs:integer*"> <xsl:for-each select="$itemIds"> <xsl:variable name="itemId" select="."/> <xsl:value-of select="count($fields[generate-id(ancestor::item)=$itemId])"/> </xsl:for-each> </xsl:variable>  <xsl:value-of select="max($nums)"/> </xsl:for-each> </xsl:variable> <xsl:template match="/">  <xsl:variable name="headers" as="xs:string*"> <xsl:for-each select="1 to $fieldCnt"> <xsl:variable name="index" select="."/>  <xsl:variable name="fieldName" select="$allFields[$index][1]/local-name()"/>  <xsl:for-each select="1 to $reptNums[$index]"> <xsl:value-of select="$fieldName"/> </xsl:for-each> </xsl:for-each> </xsl:variable>  <xsl:value-of select="string-join($headers,',')"/>  <xsl:text>,allReviews</xsl:text>  <xsl:apply-templates/> </xsl:template> <xsl:template match="item">  <xsl:value-of select="$newLine"/>  <xsl:variable name="itemId" select="generate-id()"/>   <xsl:for-each select="$allFields">  <xsl:variable name="fieldName" select="local-name()"/>  <xsl:variable name="fields" select="key('field', $fieldName)[generate-id(ancestor::item)=$itemId]"/>  <xsl:for-each select="$fields"> <xsl:value-of select="."/> <xsl:value-of select="$comma"/> </xsl:for-each>  <xsl:variable name="index" select="position()"/>  <xsl:for-each select="1 to $reptNums[$index] - count($fields)"> <xsl:value-of select="$comma"/> </xsl:for-each> </xsl:for-each>  <xsl:value-of select="string-join(reviews/review, $pipe)"/> </xsl:template> </xsl:stylesheet>

关于定义为评论/评论/ *
的评论的编辑
抱歉，我错过了review叶标记更深一层。

所需的更正也很简单：

将创建 field 键的指令更改为：

<xsl:key name="field" match="//*[not(*) and not(ancestor::reviews)]" use="local-name()"/>

实际上，您只更改match属性。现在关键包括：

叶子节点（//*[not(*)，和以前一样），

但不包括reviews代码（not(ancestor::reviews)）的后代。

将创建内容行的第2部分的指令更改为：

<xsl:value-of select="string-join(reviews//*[not(*)], $pipe)"/>

实际上，更改只涉及string-join函数的第一个参数。

现在allReviews是从所有叶节点创建的 - reviews的后代（在当前item中）。

请注意，现在您还可以拥有review没有从属的叶节点，例如<review>xxx</review>。

从多级xml

到目前为止我能够构建的XSLT：

输出我在上面显示的xml：

2 个答案:

编辑有关 allReviews 字段

关于定义为评论/评论/ *