SQL Server正则表达式清除标记

时间:2017-09-15 12:33:08

标签: sql sql-server regex sql-server-2012

我在数据中有以下HTML内容:

outer text <span class="cssname">inner text to be removed along with tags</span> further text

我想在查询中使用正则表达式删除所有特定标记以及内部文本<span with class='cssname'

我喜欢的预期输出是:

'outer text further text'

2 个答案:

答案 0 :(得分:0)

SQL Server中不像其他语言那样完全支持正则表达式。这适用于单个标签。

$ cat tst.awk
BEGIN { split("10-1 15 17",tmp); for (i in tmp) goodVals[tmp[i]] }
$2 != prevPivot { prtCurrSet() }
{ seen[$9]; currSet = currSet $0 ORS; prevPivot = $2 }
END { prtCurrSet() }
function prtCurrSet(    val,allGoodPresent,someBadPresent) {
    allGoodPresent = 1
    for (val in goodVals) {
        if ( !(val in seen) ) {
            allGoodPresent = 0
        }
        delete seen[val]
    }
    someBadPresent = length(seen)
    if ( allGoodPresent && !someBadPresent ) {
        printf "%s", currSet
    }
    currSet = ""
    delete seen
}

$ awk -f tst.awk file
S   236 1365    *   0   *   *   *   15  1   c474    152
H   236 279 95  +   0   0   765I279M321I    10-1    1   s7689   1
H   236 301 99.7    -   0   0   908I301M156I    15  1   s8443   1
H   236 563 95.2    -   0   0   728I563M74I 17  1   c1725   12
H   236 97  97.9    -   0   0   732I97M536I 17  1   s11472  1

答案 1 :(得分:0)

这样可以调整HTML以从常规文本中创建<content>元素,并将结果转换为XML。这是在CROSS APPLY部分完成的。

第二步使用XQuery查询<content>元素中的文本(从而剥离<span>元素)。

DECLARE @tt TABLE(t NVARCHAR(MAX));
INSERT INTO @tt(t)VALUES(N'outer text <span class="cssname">inner text to be removed along with tags</span> further text');

SELECT
    stripped=CAST(x.query('for $i in (/content) return $i/text()') AS NVARCHAR(MAX))
FROM
    @tt
    CROSS APPLY (
        SELECT
            x=CAST('<content>'+REPLACE(REPLACE(t,'<span','</content><span'),'/span>','/span><content>')+'</content>' AS XML)
    ) AS f

结果:

outer text  further text