这是我的XML文档的结构:
<root>
<txt>text here http <b>may</b> occur <i>many<sup>TM</sup></i> times.</txt>
</root>
处理后应该如下所示:
<root>
<txt>text here </txt>
<url>http</url>
<txt> <b>may</b> occur <i>many<sup>TM</sup></i> times.</txt>
</root>
(为了清晰起见,手动添加了换行符。)
以下模板让它“几乎”正确,但是我注释掉的部分当然不正确:
<xsl:template match="txt/text()[contains(.,'http')]">
<xsl:variable name="here" select="." />
<xsl:analyze-string select="." regex="htt[^ ]+">
<xsl:matching-substring>
<!-- this would solve all problems:
let's just close the txt-element for a second ...
<xsl:text></txt></xsl:text>
-->
<xsl:element name="uri">
<xsl:attribute name="href" select="." />
<xsl:value-of select="."/>
</xsl:element>
<!-- ... and open the txt-element again: nice!
<xsl:text<txt></xsl:text>
-->
</xsl:matching-substring>
<xsl:non-matching-substring>
<txt> <!-- not needed for the fake -->
<xsl:copy-of select="."/>
</txt> <!-- dito -->
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
相反,我使用其他模板将txt的所有其他部分包装成txt元素,就像这样。结果也有效但不可用:
<xsl:template match="txt">
<!-- only needed für the fake solution above:
<xsl:copy> -->
<xsl:apply-templates />
<!-- </xsl:copy> -->
</xsl:template>
<xsl:template match="txt/text()[not(contains(.,'http'))]">
<txt>
<xsl:copy-of select="." />
</txt>
</xsl:template>
<xsl:template match="txt/*" name="element_wrapper">
<txt>
<xsl:copy>
<xsl:apply-templates />
</xsl:copy>
</txt>
</xsl:template>
结果很难看,但有效:
<root>
<txt>text here </txt>
<url>http</url>
<txt> </txt>
<txt><b>may</b></txt>
<txt> occur </txt>
<txt><i>many<sup>TM</sup></i></txt>
<txt> times.</txt>
</root>
(再次,我添加的换行符)
我到目前为止看到的所有其他“解决方案”在元素边界处分开或仅对字符串进行标记,但它们不会在文本的中间分割。也许我的工作解决方案可以通过删除所有相邻</txt><txt>
来重新格式化,但我不知道如何实现它。
答案 0 :(得分:0)
我建议使用模式的两个步骤,一个在文本节点上使用Sub ExtractCSV()
Dim wb As Workbook
Dim strfile As String, strpath As String
strpath = "C:\Users\Jared\Desktop\Processed\Text\"
strfile = Dir("C:\Users\Jared\Desktop\Processed\Text\*.txt")
Do While strfile <> vbNullString
Set wb = Workbooks.Add()
wb.Sheets(1).Name = "Original Summary"
wb.Sheets.Add After:=wb.Sheets(wb.Worksheets.Count)
wb.Sheets(2).Name = "Frame"
Call TopSummary(wb, strpath, strfile)
Call BottomFrame(wb, strpath, strfile)
wb.SaveAs strpath & "\" & Replace(strfile, ".csv", ".xlsx"), xlWorkbookDefault
wb.Close True
strfile = Dir
Loop
Set wb = Nothing
End Sub
Function TopSummary(currwb As Workbook, strpath As String, strfile As String)
Dim conn As Object, rst As Object
Dim strConnection As String, strSQL As String
Dim i As Integer
Set conn = CreateObject("ADODB.Connection")
Set rst = CreateObject("ADODB.Recordset")
' CONNECTION STRING
strConnection = "Provider=Microsoft.ACE.OLEDB.12.0;" _
& "Data Source=" & strpath & ";" _
& "Extended Properties=""text;HDR=Yes;FMT=Delimited;"""
' OPEN DB CONNECTION
conn.Open strConnection
' QUERY CSV
strSQL = " SELECT TOP 52 * FROM " & strfile
' OPEN QUERY RECORDSET
rst.Open strSQL, conn
currwb.Sheets(1).Range("A2").CopyFromRecordset rst
currwb.Sheets(1).Range("A:A").TextToColumns DataType:=xlDelimited, _
ConsecutiveDelimiter:=False, Tab:=True
rst.Close: conn.Close
Set rst = Nothing: Set conn = Nothing
End Function
Function BottomFrame(currwb As Workbook, strpath As String, strfile As String)
Dim qt As QueryTable
' ADD QUERYTABLE
With currwb.Sheets(2).QueryTables.Add(Connection:="TEXT;" & strpath & "\" & strfile, _
Destination:=currwb.Sheets(2).Cells(1, 1))
.TextFileStartRow = 53
.TextFileParseType = xlDelimited
.TextFileConsecutiveDelimiter = False
.TextFileTabDelimiter = True
.TextFileSemicolonDelimiter = False
.TextFileCommaDelimiter = False
.TextFileSpaceDelimiter = False
.Refresh BackgroundQuery:=False
End With
' REMOVE QUERYTABLE
For Each qt In currwb.Sheets(2).QueryTables
qt.Delete
Next qt
Set qt = Nothing
End Function
将xsl:analyze-string
或变体转换为http
元素,然后使用url
来使用for-each-group group-starting-with="url"
分裂:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:template match="@* | node()" mode="#all">
<xsl:copy>
<xsl:apply-templates select="@* | node()" mode="#current"/>
</xsl:copy>
</xsl:template>
<xsl:template match="txt">
<xsl:variable name="links">
<xsl:copy>
<xsl:apply-templates mode="insert-links"/>
</xsl:copy>
</xsl:variable>
<xsl:apply-templates select="$links/node()" mode="extract-urls"/>
</xsl:template>
<xsl:template match="text()" mode="insert-links" priority="5">
<xsl:analyze-string select="." regex="http[s]?">
<xsl:matching-substring>
<url href="{.}">
<xsl:value-of select="."/>
</url>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
<xsl:template match="*[url]" mode="extract-urls">
<xsl:for-each-group select="node()" group-starting-with="url">
<xsl:choose>
<xsl:when test="self::url">
<xsl:copy-of select="."/>
<xsl:element name="{name(..)}">
<xsl:apply-templates select="current-group() except ." mode="extract-urls"/>
</xsl:element>
</xsl:when>
<xsl:otherwise>
<xsl:element name="{name(..)}">
<xsl:apply-templates select="current-group()"/>
</xsl:element>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:template>
</xsl:stylesheet>
转换输入
<txt>text here http <b>may</b> occur <i>many<sup>TM</sup></i> times and https as well.</txt>
进入输出
<txt>text here </txt><url href="http">http</url><txt> <b>may</b> occur <i>many<sup>TM</sup></i> times and </txt><url href="https">https</url><txt> as well.</txt>