XSLT计算特定子元素的共现

时间:2014-02-15 20:59:42

标签: xml xslt

我正在尝试计算xml文档中事件记录中特定人员的共现次数。我的源文档由事件元素组成,这些元素包含p元素中的散文和bibl元素中的书目记录,这两个元素都包含对人的引用。我希望能够计算出两个人在整个文档中的事件中出现的频率。我一直在使用XSLT 2.0,但可以切换到3.0。

(例如,我怎样才能得到Nancy Drew和Dick Tracy在下面的活动中的答案3的答案?或者Dick Tracy和Sam Spade的1次?)

<listEvent>
        <event xml:id="e1">
           <p>pretium eget erat eu cursus. Duis pulvinar lectus sed quam vehicula tincidunt in
              vel nunc. Cras convallis elementum diam. Sed nec viverra magna. Then <name
                 SameAs="detectives.xml#ND">Nancy Drew</name> solved the case. A consequat
              tortor molestie ut. Praesent lobortis ipsum sit amet bibendum consequat. </p>

           <bibl><name SameAs="detectives.xml#DT">Tracy, Dick</name>. The Mysterious Case of the
              Orange Fish. Penguin Publishing. </bibl>
           <bibl><name SameAs="detectives.xml#SH">Holmes, Sherlock</name>. The Case of the Blue
              Carbuncle Penguin Publishing. </bibl>

        </event>
        <event xml:id="e2">
           <p> facilisis turpis eu, gravida enim. Mauris adipiscing magna consequat dolor
              auctor, sit amet tincidunt felis auctor. <name SameAs="detectives.xml#ND">Nancy
                 Drew</name> and <name SameAs="detectives.xml#DT">Dick Tracy</name> went into
              business together. Aliquam pharetra semper erat, at viverra tellus vestibulum
              quis. Sed facilisis convallis justo, suscipit fermentum lorem egestas nec.
              Phasellus in aliquam eros, vitae fringilla augue </p>

           <bibl><name SameAs="detectives.xml#TH">Hardy, Tom</name>. Growing Up Is Hard to Do:
              The Story of a Boy Detective. Knopf Press. </bibl>
           <bibl><name SameAs="detectives.xml#SH">Holmes, Sherlock</name>. The Case of the Blue
              Carbuncle. Penguin Publishing. </bibl>
           <bibl><name SameAs="detectives.xml#SH">Holmes, Sherlock</name>. The Hound of the
              Baskervilles. Arsenal Press. </bibl>

        </event>
        <event xml:id="e3">
           <p> Curabitur dapibus eu ligula sed elementum. Curabitur sit amet nisi dictum. <name
                 SameAs="detectives.xml#SS">Sam Spade</name> was the only detective in town.
              Donec cursus diam sem, astor. </p>

           <bibl><name SameAs="detectives.xml#TH">Hardy, Tom</name>. Growing Up Is Hard to Do:
              The Story of a Boy Detective. Knopf Press. </bibl>
           <bibl><name SameAs="detectives.xml#SS">Spade, Sam</name>. My Friends' Business
              Ventures. Knopf Press. </bibl>
           <bibl><name SameAs="detectives.xml#DN">Drew, Nancy</name>. Blonde and Curious.
              Arsenal Press.</bibl>

        </event>
        <event xml:id="e4">
           <p> Duis pulvinar lectus sed quam vehicula tincidunt in vel nunc. <name
                 SameAs="detectives.xml#ND">Nancy Drew</name> and <name
                 SameAs="detectives.xml#DT">Dick Tracy</name> made 110% profit that year. Cras
              convallis elementum diam. Sed nec viverra magna. A consequat tortor molestie ut.
              Praesent lobortis ipsum sit amet bibendum consequat. </p>

           <bibl><name SameAs="detectives.xml#SS">Spade, Sam</name>. My Friends' Business
              Ventures. Knopf Press. </bibl>
           <bibl><name SameAs="detectives.xml#MH">Holmes, Mycroft</name>. Sons and Brothers.
              Knopf Press. </bibl>
        </event>
     </listEvent>

@ michael.hor257k我喜欢你的想法。我希望得到如下所示的输出:

<gexf> <graph><nodes count="77">
<node id="1.0" label="Sam Spade"/>
<node id="2.0" label="Dick Tracy"/>
<node id="3.0" label="Nancy Drew"/>
…
</nodes>

<edges count="254">
<edge id="1" source="1.0" target="2.0" weight="1.0"/>
<edge id="2" source="1.0" target="3.0" weight="2.0"/>
<edge id="3" source="2.0" target="3.0" weight="3.0"/>
…
</edges>
</graph>
</gexf>

... @weight值是我在计算时遇到的问题。

我设法为每个人分配一个节点@id。节点@ids然后组成@source和@target值(第一个是Sam Spade和Dick Tracy,第二个Sam Spade和Nancy Drew),@ weight应该是它们在doc中一起显示的次数(我 - 或许也可能 - 简化了我的例子。在我的实际源文档中,每个元素中都有一堆其他属性和值,包括每个人姓名的@n,所以使用select-value来填充@ ids,@ sources和@target很容易。)

@tim,不用担心,@ SameAs指向一个权威列表,这样无论文章中的个人名字是如何拼写的(即露西,格雷厄姆小姐和L.福斯特夫人都可以在同一个女人,女孩,在她结婚之前和之后,或在书目条目的情况下被撤销的文本中,可以将其解析为一个人。

1 个答案:

答案 0 :(得分:0)

  

不用担心,@ SameAs指向权威列表

嗯,XSLT的内容依赖于XML源文档中的内容 - 所以这里所需的计数将在之前解析不同的@SameAs值。

  

在我的实际源文档中,还有许多其他属性和   每个元素中的值,包括每个人姓名的@n

好的,因为我们没有那个,所以我使用了@SameAs属性,好像它是一个独特的id。以下实际上是一个XSLT 1.0样式表,由EXSLT set:distinct()函数强化。这只是一个草图,其中有一些脚手架留在里面,所以我们可以看看这是否朝着正确的方向发展。

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:set="http://exslt.org/sets"
extension-element-prefixes="set">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>

<xsl:key name="eventByID" match="event" use=".//name/@SameAs" />

<xsl:variable name="distinct_nodes" select="set:distinct(/listEvent/event//name/@SameAs)" />
<xsl:variable name="root" select="/" />

<xsl:template match="/">
<graph>
    <nodes>
        <xsl:for-each select="$distinct_nodes">
            <node id="{.}"/>
        </xsl:for-each>
    </nodes>
    <edges>
        <xsl:for-each select="$distinct_nodes[not(position()=last())]">
            <xsl:variable name="source" select="." />
            <xsl:variable name="pos" select="position()" />
                <xsl:for-each select="$distinct_nodes[position()>$pos]">
                    <xsl:variable name="target" select="." />
                    <xsl:variable name="common_events" select="key('eventByID', $source)[@xml:id=key('eventByID', $target)/@xml:id]" />
                    <xsl:if test="$common_events">
                        <edge source="{$source}" target="{$target}" weight="{count($common_events)}">
                        <!-- use this for test purposes -->
                            <!-- 
                            <xsl:for-each select="$common_events">
                                <event id="{@xml:id}"/>
                            </xsl:for-each>
                             -->
                        </edge>
                    </xsl:if>
                </xsl:for-each>
        </xsl:for-each>
    </edges>
</graph>
</xsl:template>
</xsl:stylesheet>

应用于您的示例XML,结果为:

<?xml version="1.0" encoding="utf-8"?>
<graph>
   <nodes>
      <node id="detectives.xml#ND"/>
      <node id="detectives.xml#DT"/>
      <node id="detectives.xml#SH"/>
      <node id="detectives.xml#TH"/>
      <node id="detectives.xml#SS"/>
      <node id="detectives.xml#DN"/>
      <node id="detectives.xml#MH"/>
   </nodes>
   <edges>
      <edge source="detectives.xml#ND" target="detectives.xml#DT" weight="3"/>
      <edge source="detectives.xml#ND" target="detectives.xml#SH" weight="2"/>
      <edge source="detectives.xml#ND" target="detectives.xml#TH" weight="1"/>
      <edge source="detectives.xml#ND" target="detectives.xml#SS" weight="1"/>
      <edge source="detectives.xml#ND" target="detectives.xml#MH" weight="1"/>
      <edge source="detectives.xml#DT" target="detectives.xml#SH" weight="2"/>
      <edge source="detectives.xml#DT" target="detectives.xml#TH" weight="1"/>
      <edge source="detectives.xml#DT" target="detectives.xml#SS" weight="1"/>
      <edge source="detectives.xml#DT" target="detectives.xml#MH" weight="1"/>
      <edge source="detectives.xml#SH" target="detectives.xml#TH" weight="1"/>
      <edge source="detectives.xml#TH" target="detectives.xml#SS" weight="1"/>
      <edge source="detectives.xml#TH" target="detectives.xml#DN" weight="1"/>
      <edge source="detectives.xml#SS" target="detectives.xml#DN" weight="1"/>
      <edge source="detectives.xml#SS" target="detectives.xml#MH" weight="1"/>
   </edges>
</graph>