如何使用sql server从字符串中删除所有html标记和特殊字符

时间:2013-12-11 10:08:33

标签: sql sql-server

我使用C#作为asp.net开发人员,我从客户端收到这样的文字:

> <p><a
> href="http://www.vogue.co.uk/person/kate-winslet">KATE
> WINSLET</a> has given birth to a 9lb baby boy. The
> Oscar-winning actress welcomed the baby with her husband Ned Rocknroll
> at a hospital in Sussex.</p>
> 
> <p>"Kate had 'Baby Boy Winslet' on
> Saturday at an NHS Hospital," Winslet's spokeswoman
> said, adding that the family were "thrilled to
> bits".</p>
> 
> <p>The announcement suggests that the child might bear his
> mother's surname, rather than his father's slightly
> more unusual moniker.</p>
> 
> <p>The baby is Winslet's third - she is already mother
> to Mia, 13, and Joe, eight,  from previous relationships -
> and her husband's first. They met on Necker Island, owned by
> Rocknroll's uncle, Richard Branson, and<a
> href="http://www.vogue.co.uk/news/2013/kate-winslet-married-to-ned-rocknroller---wedding-details">married almost a year ago</a> in New York.</p>

我需要一种方法来使用sql server 2008或更高版本来提取没有标签和特殊字符的真实文本吗?

3 个答案:

答案 0 :(得分:1)

我能建议的最好的方法是使用.net HTML解析器或者包含在SQL CLR函数中的解析器。或者,如果需要,可以在SQL CLR中包装正则表达式。

注意正则表达式限制:http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

原始SQL语言不会这样做:它不是字符串(或HTML)处理语言

答案 1 :(得分:0)

HTML非常复杂,如果没有HTML Parser,这是一个非常糟糕的主意。

您可能对This Question感兴趣。 接受的答案是通过命令行使用Lynx并将输出转储到文件中。 如果你可以在用户页面加载之外进行,那么它可能是最好的选择。

答案 2 :(得分:0)

我最近有同样的要求(删除HTML标签和实体),所以在SQL Server中开发了这个功能。

CREATE FUNCTION CTU_FN_StripHTML (@dirtyText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
-- Cleaned Text
DECLARE @cleanText NVARCHAR(MAX)=RTRIM(LTRIM(@dirtyText));
-- HTML Tags
DECLARE @tagStart SMALLINT =PATINDEX('%<%>%', @cleanText);
DECLARE @tagEnd SMALLINT;
DECLARE @tagLength SMALLINT;
-- HTML Entities
DECLARE @entityStart SMALLINT =PATINDEX('%&%;%', @cleanText);
DECLARE @entityEnd SMALLINT;
DECLARE @entityLength SMALLINT;
WHILE @tagStart > 0
    OR 
    @entityStart > 0
BEGIN
-- Remove HTML Tag 
SET @tagStart=PATINDEX('%<%>%', @cleanText);
IF @tagStart > 0 
BEGIN
SET @tagEnd=CHARINDEX('>', @cleanText, @tagStart);
SET @tagLength=(@tagEnd - @tagStart) + 1;
SET @cleanText=STUFF(@cleanText, @tagStart, @tagLength, '');
END;
-- Remove HTML Entity
SET @entityStart=PATINDEX('%&%;%', @cleanText);
IF @entityStart > 0 
BEGIN
SET @entityEnd=CHARINDEX(';', @cleanText, @entityStart);
SET @entityLength=(@entityEnd - @entityStart) + 1;
SET @cleanText=STUFF(@cleanText, @entityStart, @entityLength, '');
END;
END;

SET @cleanText = RTRIM(LTRIM(@cleanText))
RETURN @cleanText;
END;