我使用C#作为asp.net开发人员,我从客户端收到这样的文字:
> <p><a
> href="http://www.vogue.co.uk/person/kate-winslet">KATE
> WINSLET</a>&nbsp;has given birth to a 9lb baby boy. The
> Oscar-winning actress welcomed the baby with her husband Ned Rocknroll
> at a hospital in Sussex.</p>
>
> <p>&quot;Kate had &#39;Baby Boy Winslet&#39; on
> Saturday at an NHS Hospital,&quot; Winslet&#39;s spokeswoman
> said, adding that the family were &quot;thrilled to
> bits&quot;.</p>
>
> <p>The announcement suggests that the child might bear his
> mother&#39;s surname, rather than his father&#39;s slightly
> more unusual moniker.</p>
>
> <p>The baby is Winslet&#39;s third - she is already mother
> to Mia, 13, and Joe, eight, &nbsp;from previous relationships -
> and her husband&#39;s first. They met on Necker Island, owned by
> Rocknroll&#39;s uncle, Richard Branson, and<a
> href="http://www.vogue.co.uk/news/2013/kate-winslet-married-to-ned-rocknroller---wedding-details">married almost a year ago</a>&nbsp;in New York.</p>
我需要一种方法来使用sql server 2008或更高版本来提取没有标签和特殊字符的真实文本吗?
答案 0 :(得分:1)
我能建议的最好的方法是使用.net HTML解析器或者包含在SQL CLR函数中的解析器。或者,如果需要,可以在SQL CLR中包装正则表达式。
注意正则表达式限制:http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
原始SQL语言不会这样做:它不是字符串(或HTML)处理语言
答案 1 :(得分:0)
HTML非常复杂,如果没有HTML Parser,这是一个非常糟糕的主意。
您可能对This Question感兴趣。 接受的答案是通过命令行使用Lynx并将输出转储到文件中。 如果你可以在用户页面加载之外进行,那么它可能是最好的选择。
答案 2 :(得分:0)
我最近有同样的要求(删除HTML标签和实体),所以在SQL Server中开发了这个功能。
CREATE FUNCTION CTU_FN_StripHTML (@dirtyText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
-- Cleaned Text
DECLARE @cleanText NVARCHAR(MAX)=RTRIM(LTRIM(@dirtyText));
-- HTML Tags
DECLARE @tagStart SMALLINT =PATINDEX('%<%>%', @cleanText);
DECLARE @tagEnd SMALLINT;
DECLARE @tagLength SMALLINT;
-- HTML Entities
DECLARE @entityStart SMALLINT =PATINDEX('%&%;%', @cleanText);
DECLARE @entityEnd SMALLINT;
DECLARE @entityLength SMALLINT;
WHILE @tagStart > 0
OR
@entityStart > 0
BEGIN
-- Remove HTML Tag
SET @tagStart=PATINDEX('%<%>%', @cleanText);
IF @tagStart > 0
BEGIN
SET @tagEnd=CHARINDEX('>', @cleanText, @tagStart);
SET @tagLength=(@tagEnd - @tagStart) + 1;
SET @cleanText=STUFF(@cleanText, @tagStart, @tagLength, '');
END;
-- Remove HTML Entity
SET @entityStart=PATINDEX('%&%;%', @cleanText);
IF @entityStart > 0
BEGIN
SET @entityEnd=CHARINDEX(';', @cleanText, @entityStart);
SET @entityLength=(@entityEnd - @entityStart) + 1;
SET @cleanText=STUFF(@cleanText, @entityStart, @entityLength, '');
END;
END;
SET @cleanText = RTRIM(LTRIM(@cleanText))
RETURN @cleanText;
END;