Question

这是我的XML文档的结构：

<root>
    <txt>text here http <b>may</b> occur <i>many<sup>TM</sup></i> times.</txt>
</root>

处理后应该如下所示：

<root>
    <txt>text here </txt>
    <url>http</url>
    <txt> <b>may</b> occur <i>many<sup>TM</sup></i> times.</txt>
</root>

（为了清晰起见，手动添加了换行符。）

以下模板让它“几乎”正确，但是我注释掉的部分当然不正确：

<xsl:template match="txt/text()[contains(.,'http')]">
    <xsl:variable name="here" select="." />

    <xsl:analyze-string select="." regex="htt[^ ]+">

        <xsl:matching-substring>
                            <!-- this would solve all problems: 
                             let's just close the txt-element for a second ...
                <xsl:text></txt></xsl:text>
                            -->
            <xsl:element name="uri">
                <xsl:attribute name="href" select="." />
                <xsl:value-of select="."/>
            </xsl:element>
                            <!-- ... and open the txt-element again: nice!
                <xsl:text<txt></xsl:text>
                             -->    
        </xsl:matching-substring>

        <xsl:non-matching-substring>
                         <txt> <!-- not needed for the fake -->
            <xsl:copy-of select="."/>
                         </txt> <!-- dito -->
        </xsl:non-matching-substring>
    </xsl:analyze-string>
</xsl:template>

相反，我使用其他模板将txt的所有其他部分包装成txt元素，就像这样。结果也有效但不可用：

<xsl:template match="txt">
    <!-- only needed für the fake solution above:
            <xsl:copy> -->
        <xsl:apply-templates />
    <!-- </xsl:copy> -->
</xsl:template>

<xsl:template match="txt/text()[not(contains(.,'http'))]">
    <txt>
        <xsl:copy-of select="." />
    </txt>
</xsl:template>

<xsl:template match="txt/*" name="element_wrapper">
    <txt>
        <xsl:copy>
            <xsl:apply-templates />
        </xsl:copy>
    </txt>
</xsl:template>

结果很难看，但有效：

<root>
    <txt>text here </txt>
    <url>http</url>
    <txt> </txt>
    <txt><b>may</b></txt>
    <txt> occur </txt>
    <txt><i>many<sup>TM</sup></i></txt>
    <txt> times.</txt>
</root>

（再次，我添加的换行符）

我到目前为止看到的所有其他“解决方案”在元素边界处分开或仅对字符串进行标记，但它们不会在文本的中间分割。也许我的工作解决方案可以通过删除所有相邻</txt><txt>来重新格式化，但我不知道如何实现它。

Answer 1

我建议使用模式的两个步骤，一个在文本节点上使用Sub ExtractCSV() Dim wb As Workbook Dim strfile As String, strpath As String strpath = "C:\Users\Jared\Desktop\Processed\Text\" strfile = Dir("C:\Users\Jared\Desktop\Processed\Text\*.txt") Do While strfile <> vbNullString Set wb = Workbooks.Add() wb.Sheets(1).Name = "Original Summary" wb.Sheets.Add After:=wb.Sheets(wb.Worksheets.Count) wb.Sheets(2).Name = "Frame" Call TopSummary(wb, strpath, strfile) Call BottomFrame(wb, strpath, strfile) wb.SaveAs strpath & "\" & Replace(strfile, ".csv", ".xlsx"), xlWorkbookDefault wb.Close True strfile = Dir Loop Set wb = Nothing End Sub Function TopSummary(currwb As Workbook, strpath As String, strfile As String) Dim conn As Object, rst As Object Dim strConnection As String, strSQL As String Dim i As Integer Set conn = CreateObject("ADODB.Connection") Set rst = CreateObject("ADODB.Recordset") ' CONNECTION STRING strConnection = "Provider=Microsoft.ACE.OLEDB.12.0;" _ & "Data Source=" & strpath & ";" _ & "Extended Properties=""text;HDR=Yes;FMT=Delimited;""" ' OPEN DB CONNECTION conn.Open strConnection ' QUERY CSV strSQL = " SELECT TOP 52 * FROM " & strfile ' OPEN QUERY RECORDSET rst.Open strSQL, conn currwb.Sheets(1).Range("A2").CopyFromRecordset rst currwb.Sheets(1).Range("A:A").TextToColumns DataType:=xlDelimited, _ ConsecutiveDelimiter:=False, Tab:=True rst.Close: conn.Close Set rst = Nothing: Set conn = Nothing End Function Function BottomFrame(currwb As Workbook, strpath As String, strfile As String) Dim qt As QueryTable ' ADD QUERYTABLE With currwb.Sheets(2).QueryTables.Add(Connection:="TEXT;" & strpath & "\" & strfile, _ Destination:=currwb.Sheets(2).Cells(1, 1)) .TextFileStartRow = 53 .TextFileParseType = xlDelimited .TextFileConsecutiveDelimiter = False .TextFileTabDelimiter = True .TextFileSemicolonDelimiter = False .TextFileCommaDelimiter = False .TextFileSpaceDelimiter = False .Refresh BackgroundQuery:=False End With ' REMOVE QUERYTABLE For Each qt In currwb.Sheets(2).QueryTables qt.Delete Next qt Set qt = Nothing End Function将xsl:analyze-string或变体转换为http元素，然后使用url来使用for-each-group group-starting-with="url"分裂：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:template match="@* | node()" mode="#all">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="txt">
        <xsl:variable name="links">
            <xsl:copy>
                <xsl:apply-templates mode="insert-links"/>
            </xsl:copy>
        </xsl:variable>
        <xsl:apply-templates select="$links/node()" mode="extract-urls"/>
    </xsl:template>

    <xsl:template match="text()" mode="insert-links" priority="5">
        <xsl:analyze-string select="." regex="http[s]?">
            <xsl:matching-substring>
                <url href="{.}">
                    <xsl:value-of select="."/>
                </url>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>

    <xsl:template match="*[url]" mode="extract-urls">
        <xsl:for-each-group select="node()" group-starting-with="url">
            <xsl:choose>
                <xsl:when test="self::url">
                    <xsl:copy-of select="."/>
                    <xsl:element name="{name(..)}">
                        <xsl:apply-templates select="current-group() except ." mode="extract-urls"/>
                    </xsl:element>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:element name="{name(..)}">
                        <xsl:apply-templates select="current-group()"/>
                    </xsl:element>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:for-each-group>
    </xsl:template>

</xsl:stylesheet>

转换输入

<txt>text here http <b>may</b> occur <i>many<sup>TM</sup></i> times and https as well.</txt>

进入输出

<txt>text here </txt><url href="http">http</url><txt> <b>may</b> occur <i>many<sup>TM</sup></i> times and </txt><url href="https">https</url><txt> as well.</txt>

如何拆分混合内容的元素？

1 个答案: