我对xslt比较陌生,现在用html 2 xml转换脚本卡住了。我确实设法完成了基础知识,但有些问题对我来说有点太深,并希望有人能指出我正确的方向。情况如下:
我有一个html文件,我已经转换了一点,看起来像这样:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="http://www.beloningenbelasting.nl/dtd/bb.xslt"?>
<!-- Copyright 2017 Fiscaal up to Date BV, Eindhoven NL -->
<!DOCTYPE BeloningEnBelasting SYSTEM "http://www.beloningenbelasting.nl/dtd/bb.dtd">
<BeloningEnBelasting>
<body>
<div id="_idContainer000">
<p>remove</p>
</div>
<div id="_idContainer001">
<p>remove</p>
</div>
<div id="_idContainer002">
<p>remove</p>
</div>
<div id="_idContainer003">
<p>info of id 4 could be here</p>
</div>
<div id="_idContainer004">
<p class="Body-Text-Extra" lang="en-US">Hoofdrubriek</p>
<p class="DUkopartikel">art. nr. en titel</p>
<p class="platte-tekst">platte tekst met erin:
<span class="cursief">tekst cursief</span>
<span class="vet">tekst vet</span></p>
<p class="Bodytextvet">bold tekst</p>
<p class="DUinspring">List item</p>
<p class="DUbron">bron</p>
<p class="Body-Text-Extra" lang="en-US">Hoofdrubriek</p>
<p class="DUkopartikel">art. nr. en titel</p>
<p class="platte-tekst">platte tekst met erin:
<span class="cursief">tekst cursief</span>
<span class="vet">tekst vet</span></p>
<p class="Bodytextvet">bold tekst</p>
<p class="DUinspring">List item</p>
<p class="DUbron">bron</p>
</div>
<div>
<div id="_idContainer005">
<p>kan weg</p>
</div>
</div>
</body>
</BeloningEnBelasting>
结果应该是这样的:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="http://www.beloningenbelasting.nl/dtd/bb.xslt"?>
<!-- Copyright 2017 Fiscaal up to Date BV, Eindhoven NL -->
<!DOCTYPE BeloningEnBelasting SYSTEM "http://www.beloningenbelasting.nl/dtd/bb.dtd">
<BeloningEnBelasting>
<artikel id="manual input and copy paste">
<uitgave>manual input and copy paste</uitgave>
<datum>manual input and copy paste</datum>
<rubriek>
<hoofdrubriek></hoofdrubriek>
</rubriek>
<titel></titel>
<tekst>
<p> with <i>cursive</i> and <b>bold</b> text</p>
<ol>
<li>orderned list</li>
</ol>
<ul>
<li>unorderned list</li>
</ul>
</tekst>
<noten>
<noot>...</noot>
</noten>
<bronnen>
<bron>...</bron>
</bronnen>
</artikel>
<artikel id="manual input and copy paste">
<uitgave>manual input and copy paste</uitgave>
<datum>manual input and copy paste</datum>
<rubriek>
<hoofdrubriek></hoofdrubriek>
</rubriek>
<titel></titel>
<tekst>
<p> with <i>cursive</i> and <b>bold</b> text</p>
<ol>
<li>orderned list</li>
</ol>
<ul>
<li>unorderned list</li>
</ul>
</tekst>
<noten>
<noot>...</noot>
</noten>
<bronnen>
<bron>...</bron>
</bronnen>
</artikel>
</BeloningEnBelasting>
正如您所看到的,<div id="_idContainer004">
或<div id="_idContainer003">
中的几乎所有数据都应划分为多个<artikel>
。
我编写了以下xsl来完成第一步:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:template match="node() | @*">
<xsl:copy>
<xsl:apply-templates select="node() | @*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p[@class= 'DUbron']">
<bronnen>
<bron>
<xsl:apply-templates />
</bron>
</bronnen>
</xsl:template>
<xsl:template match="p[@class= 'platte-tekst']">
<p>
<xsl:apply-templates />
</p>
</xsl:template>
<xsl:template match="p[@class= 'Bodytextvet']">
<p>
<b>
<xsl:apply-templates />
</b>
</p>
</xsl:template>
<xsl:template match="p[@class= 'bodycursief']">
<p>
<i>
<xsl:apply-templates />
</i>
</p>
</xsl:template>
<xsl:template match="p[@class= 'DUinspring']">
<li>
<xsl:apply-templates />
</li>
</xsl:template>
<xsl:template match="span[@class= 'vet']">
<b>
<xsl:apply-templates />
</b>
</xsl:template>
<xsl:template match="span[@class= 'cursief']">
<i>
<xsl:apply-templates />
</i>
</xsl:template>
<xsl:template match="div[@id= '_idContainer000']|div[@id= '_idContainer001']|div[@id= '_idContainer002']"/>
</xsl:stylesheet>
这个xsl只清理文档,但我坚持创建<artikel>
格式并应用正确的<rubriek><hoofdrubriek>main title</hoofdrubriek></rubriek>
,因为这个标题每隔几篇文章只给出一次并适用于前面的文章,直到那里是一个新的。
有人能指出我正确的方向吗?
亲切的问候,
AJ
PS原始html文件:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title>Bel&Bel_2017_15</title>
</head>
<body id="Bel-Bel_2017_15" lang="nl-NL">
<div id="_idContainer000">
<p class="colofon" lang="en-US">unused data</p>
<p class="colofon" lang="en-US">unused data</p>
<p class="colofon" lang="en-US">unused data</p>
<p class="colofon" lang="en-US">V</p>
</div>
<div id="_idContainer001">
<p class="datum" lang="en-US">unused data</p>
<p class="datum" lang="en-US">unused data</p>
</div>
<div id="_idContainer002" class="Basistekstkader">
<p class="Inhoud" lang="en-GB"><a href="">unused data</a></p>
<p class="Inhoud" lang="en-GB"><a href="">unused data</a></p>
<p class="Inhoud" lang="en-GB"><a href="">unused data</a></p>
</div>
<div id="_idContainer004">
<p class="Body-Text-Extra" lang="en-US">Hoofdrubriek</p>
<p class="DUkopartikel">art. nr. en titel</p>
<p class="platte-tekst">platte tekst met erin:
<span class="cursief">tekst cursief</span>
<span class="vet">tekst vet</span></p>
<p class="Bodytextvet">bold tekst</p>
<p class="DUinspring">List item</p>
<p class="DUbron">bron</p>
<p class="Body-Text-Extra" lang="en-US">Hoofdrubriek</p>
<p class="DUkopartikel">art. nr. en titel</p>
<p class="platte-tekst">platte tekst met erin:
<span class="cursief">tekst cursief</span>
<span class="vet">tekst vet</span></p>
<p class="Bodytextvet">bold tekst</p>
<p class="DUinspring">List item</p>
<p class="DUbron">bron</p>
</div>
<div>
<div id="_idContainer005">
<img src="Bel&Bel_2017_15-web-resources/image/Advertenties_APP_FINAL_ZW.png"
alt="" />
</div>
</div>
</body>
</html>