坚持html 2 xml转换

时间:2017-10-01 13:27:10

标签: html xml xslt xslt-2.0

我对xslt比较陌生,现在用html 2 xml转换脚本卡住了。我确实设法完成了基础知识,但有些问题对我来说有点太深,并希望有人能指出我正确的方向。情况如下:

我有一个html文件,我已经转换了一点,看起来像这样:

    <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="http://www.beloningenbelasting.nl/dtd/bb.xslt"?>
<!-- Copyright 2017 Fiscaal up to Date BV, Eindhoven NL -->
<!DOCTYPE BeloningEnBelasting SYSTEM "http://www.beloningenbelasting.nl/dtd/bb.dtd">
<BeloningEnBelasting>
    <body>
       <div  id="_idContainer000">
            <p>remove</p>
        </div>
       <div  id="_idContainer001">
            <p>remove</p>
        </div>
       <div  id="_idContainer002">
            <p>remove</p>
        </div>
        <div id="_idContainer003">
            <p>info of id 4 could be here</p>
        </div>
        <div id="_idContainer004">
            <p class="Body-Text-Extra" lang="en-US">Hoofdrubriek</p>
            <p class="DUkopartikel">art. nr. en titel</p>
            <p class="platte-tekst">platte tekst met erin: 
                <span class="cursief">tekst cursief</span>
                <span class="vet">tekst vet</span></p>
            <p class="Bodytextvet">bold tekst</p>
            <p class="DUinspring">List item</p>
            <p class="DUbron">bron</p>
            <p class="Body-Text-Extra" lang="en-US">Hoofdrubriek</p>
            <p class="DUkopartikel">art. nr. en titel</p>
            <p class="platte-tekst">platte tekst met erin: 
                <span class="cursief">tekst cursief</span>
                <span class="vet">tekst vet</span></p>
            <p class="Bodytextvet">bold tekst</p>
            <p class="DUinspring">List item</p>
            <p class="DUbron">bron</p>
        </div>
        <div>
            <div id="_idContainer005">
                <p>kan weg</p>
            </div>
        </div>
    </body>
</BeloningEnBelasting>

结果应该是这样的:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="http://www.beloningenbelasting.nl/dtd/bb.xslt"?>
<!-- Copyright 2017 Fiscaal up to Date BV, Eindhoven NL -->
<!DOCTYPE BeloningEnBelasting SYSTEM "http://www.beloningenbelasting.nl/dtd/bb.dtd">
<BeloningEnBelasting>
    <artikel id="manual input and copy paste">
        <uitgave>manual input and copy paste</uitgave>
        <datum>manual input and copy paste</datum>
        <rubriek>
            <hoofdrubriek></hoofdrubriek>
        </rubriek>
        <titel></titel>
        <tekst>
            <p> with <i>cursive</i> and <b>bold</b> text</p>
            <ol>
                <li>orderned list</li>
            </ol>
            <ul>
                <li>unorderned list</li>
            </ul>
        </tekst>
    <noten>
        <noot>...</noot>
    </noten>
    <bronnen>
        <bron>...</bron>
    </bronnen>
    </artikel>
    <artikel id="manual input and copy paste">
        <uitgave>manual input and copy paste</uitgave>
        <datum>manual input and copy paste</datum>
        <rubriek>
            <hoofdrubriek></hoofdrubriek>
        </rubriek>
        <titel></titel>
        <tekst>
            <p> with <i>cursive</i> and <b>bold</b> text</p>
            <ol>
                <li>orderned list</li>
            </ol>
            <ul>
                <li>unorderned list</li>
            </ul>
        </tekst>
        <noten>
            <noot>...</noot>
        </noten>
        <bronnen>
            <bron>...</bron>
        </bronnen>
    </artikel>
</BeloningEnBelasting>

正如您所看到的,<div id="_idContainer004"><div id="_idContainer003">中的几乎所有数据都应划分为多个<artikel>

我编写了以下xsl来完成第一步:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:template match="node() | @*">
        <xsl:copy>
            <xsl:apply-templates select="node() | @*"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="p[@class= 'DUbron']">
        <bronnen>
            <bron>
                <xsl:apply-templates />
            </bron>
        </bronnen>
    </xsl:template>  

    <xsl:template match="p[@class= 'platte-tekst']">
        <p>
            <xsl:apply-templates />
        </p>
    </xsl:template>  

    <xsl:template match="p[@class= 'Bodytextvet']">
        <p>
            <b>
                <xsl:apply-templates />
            </b>
        </p>
    </xsl:template>

    <xsl:template match="p[@class= 'bodycursief']">
        <p>
            <i>
                <xsl:apply-templates />
            </i>
        </p>
    </xsl:template>

    <xsl:template match="p[@class= 'DUinspring']">
        <li>
            <xsl:apply-templates />
        </li>
    </xsl:template>

    <xsl:template match="span[@class= 'vet']">
        <b>
            <xsl:apply-templates />
        </b>
    </xsl:template>

    <xsl:template match="span[@class= 'cursief']">
        <i>
                <xsl:apply-templates />
        </i>
    </xsl:template>

    <xsl:template match="div[@id= '_idContainer000']|div[@id= '_idContainer001']|div[@id= '_idContainer002']"/>
</xsl:stylesheet>

这个xsl只清理文档,但我坚持创建<artikel>格式并应用正确的<rubriek><hoofdrubriek>main title</hoofdrubriek></rubriek>,因为这个标题每隔几篇文章只给出一次并适用于前面的文章,直到那里是一个新的。

有人能指出我正确的方向吗?

亲切的问候,

AJ

PS原始html文件:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta charset="utf-8" />
        <title>Bel&amp;Bel_2017_15</title>
    </head>
    <body id="Bel-Bel_2017_15" lang="nl-NL">
        <div id="_idContainer000">
            <p class="colofon" lang="en-US">unused data</p>
            <p class="colofon" lang="en-US">unused data</p>
            <p class="colofon" lang="en-US">unused data</p>
            <p class="colofon" lang="en-US">V</p>
        </div>
        <div id="_idContainer001">
            <p class="datum" lang="en-US">unused data</p>
            <p class="datum" lang="en-US">unused data</p>
        </div>
        <div id="_idContainer002" class="Basistekstkader">
            <p class="Inhoud" lang="en-GB"><a href="">unused data</a></p>
            <p class="Inhoud" lang="en-GB"><a href="">unused data</a></p>
            <p class="Inhoud" lang="en-GB"><a href="">unused data</a></p>
        </div>
        <div id="_idContainer004">
            <p class="Body-Text-Extra" lang="en-US">Hoofdrubriek</p>
            <p class="DUkopartikel">art. nr. en titel</p>
            <p class="platte-tekst">platte tekst met erin: 
                <span class="cursief">tekst cursief</span>
                <span class="vet">tekst vet</span></p>
            <p class="Bodytextvet">bold tekst</p>
            <p class="DUinspring">List item</p>
            <p class="DUbron">bron</p>
            <p class="Body-Text-Extra" lang="en-US">Hoofdrubriek</p>
            <p class="DUkopartikel">art. nr. en titel</p>
            <p class="platte-tekst">platte tekst met erin: 
                <span class="cursief">tekst cursief</span>
                <span class="vet">tekst vet</span></p>
            <p class="Bodytextvet">bold tekst</p>
            <p class="DUinspring">List item</p>
            <p class="DUbron">bron</p>
        </div>
        <div>
            <div id="_idContainer005">
                <img src="Bel&amp;Bel_2017_15-web-resources/image/Advertenties_APP_FINAL_ZW.png"
                    alt="" />
            </div>
        </div>
    </body>
</html>

0 个答案:

没有答案