如何从包含sql server中的html内容的字段中提取文件名?

时间:2013-04-25 21:05:25

标签: sql sql-server sql-server-2000

我们有一个cms系统,可以将html内容块写入sql server数据库。 我知道这些html内容块所在的表名和字段名。 一些html包含链接()到pdf文件。这是一个片段:

<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>

我需要从所有这些html内容块中提取pdf文件名。 最后,我需要一个清单:

Tuition-Reimbursement-Deferred.pdf
Some-other-file.pdf

该字段中的所有pdf文件名。

感谢任何帮助。 感谢。

更新

我收到很多回复,非常感谢你, 但我忘了提到我们仍然在这里使用SQL Server 2000。 所以,这必须使用SQL 2000 SQL来完成。

4 个答案:

答案 0 :(得分:3)

创建此功能

create function dbo.extract_filenames_from_a_tags (@s nvarchar(max))
returns @res table (pdf nvarchar(max)) as
begin
-- assumes there are no single quotes or double quotes in the PDF filename
declare @i int, @j int, @k int, @tmp nvarchar(max);
set @i = charindex(N'.pdf', @s);
while @i > 0
begin
  select @tmp = left(@s, @i+3);
  select @j = charindex('/', reverse(@tmp)); -- directory delimiter
  select @k = charindex('"', reverse(@tmp)); -- start of href
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  select @k = charindex('''', reverse(@tmp)); -- start of href (single-quote*)
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  insert @res values (substring(@tmp, len(@tmp)-@j+2, len(@tmp)));
  select @s = stuff(@s, 1, @i+4, ''); -- remove up to ".pdf"
  set @i = charindex(N'.pdf', @s);
end
return
end
GO

使用该功能的演示

declare @t table (html varchar(max));
insert @t values
  ('
<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'),
  ('
<p>A deferred tuition payment plan, 
or view the <a href="Two files here-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>And I use single quotes
   <a href=''/look/path/The second file.pdf''
target="_blank">list</a>');

select t.*, p.pdf
from @t t
cross apply dbo.extract_filenames_from_a_tags(html) p;

<强>结果:

|HTML                  |                                       PDF |
--------------------------------------------------------------------
|<p>A deferred tui.... |        Tuition-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... | Two files here-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... |                       The second file.pdf |

SQL Fiddle Demo

答案 1 :(得分:1)

嗯它不漂亮,但这可以使用标准的Transact-SQL:

SELECT CASE WHEN CHARINDEX('.pdf', html) > 0
            THEN SUBSTRING(
                     html,
                     CHARINDEX('.pdf', html) -
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 1,
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 3)
            ELSE NULL
       END AS filename
FROM mytable

如果您愿意,可以在["/]的文件名之前展开分隔字符列表(其中 引号或斜杠)。

请参阅SQL Fiddle demo

答案 2 :(得分:1)

如何将该HTML视为XML?

declare @t table (html varchar(max));
insert @t 
    select '
    <p>A deferred tuition payment plan, 
    or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
    target="_blank">list</a>.</p>'
    union all
    select '
    <p>A deferred tuition payment plan, 
    or view the <a href="Two files here-Reimbursement-Deferred.pdf"
    target="_blank">list</a>.</p>And I use single quotes
       <a href=''/look/path/The second file.pdf''
    target="_blank">list</a>'

select  [filename] = reverse(left(reverse('/'+p.n.value('@href', 'varchar(100)')), charindex('/',reverse('/'+p.n.value('@href', 'varchar(100)')), 1) - 1))
from    (   select  cast(html as xml)
            from    @t
        ) x(doc)
cross
apply doc.nodes('//a') p(n);

结果:

filename
---------------------------------------------------------------
Tuition-Reimbursement-Deferred.pdf
Two files here-Reimbursement-Deferred.pdf
The second file.pdf

答案 3 :(得分:1)

试试这个 -

DECLARE @XML XML = 
'<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'

SELECT 
      ref_text = t.p.value('./a[1]', 'NVARCHAR(50)')
    , ref_filename = REVERSE(
                        LEFT(REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 
                        CHARINDEX('/',REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 1) - 1))
FROM @XML.nodes('/p') t(p)