在html标记之间提取文本

时间:2014-08-05 15:45:58

标签: sql postgresql

我使用了postgres

我有这个查询

 SELECT 
 row_number() OVER (ORDER BY corresp.ID_CORRESP) as rNUM ,
 transfers.id_transfer AS TRANSFER_ID_TRANSFER, 
 corresp.id_corresp as  ID_CORRESP, 
 corresp.ORDERNBR_CORRESP as  ORDERNBR_CORRESP, 
 transfers.text_transfer AS TEXT 
FROM Transfers transfers 
 left outer join correspondence corresp  on corresp.id_corresp = transfers.id_corresp 
 left outer join tranf_corresp_tocc_employee on  tranf_corresp_tocc_employee.id_transfer = transfers.id_transfer
 left outer join employee on tranf_corresp_tocc_employee.id_employe = employee.id_employe 
 left outer join employee_lang on employee.id_employe = employee_lang.id_employe
 left outer join unit on employee.id_unit = unit.id_unit 
 left outer join unit_lang on unit_lang.id_unit =unit.id_unit
 left outer join action on action.id_action = transfers.id_action  
 left outer join action_lang on action_lang.id_action = action.id_action 
 WHERE transfers.status_transfer ='P' 

transfers.text_transfer AS TEXT 的问题会返回此类结果

<div align="right"><font color="3366FF"><b><font size="3">it's&nbsp;test</font></b></font></div>

我搜索从此结果中提取正确数据的方式意味着提取it's test

所以我想在我的查询中添加相同的代码来从html标签中提取数据,我认为我应该使用这个函数 REGEXP_REPLACE

已更新:

当我尝试运行此查询时

CREATE LANGUAGE plperlu;

我有这个错误:

ERROR:  could not load library "C:/Program Files/PostgreSQL/9.2/lib/plperl.dll": %1 is not a valid Win32 application.


********** Error **********

ERROR: could not load library "C:/Program Files/PostgreSQL/9.2/lib/plperl.dll": %1 is not a valid Win32 application.
SQL state: 58P01

我在C:/ Program Files / PostgreSQL / 9.2 / lib下有plperl.dll

已更新

我尝试用这个例子的另一种方式:

CREATE FUNCTION testFunction
(@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END

但我有这个错误:

ERROR:  syntax error at or near "@"
LINE 2: (@HTMLText VARCHAR(MAX))
         ^

********** Error **********

ERROR: syntax error at or near "@"
SQL state: 42601
Character: 31

1 个答案:

答案 0 :(得分:1)

如果您想在数据库中执行此操作,请使用PL / Perl,PL / Python或类似工具进行正确的HTML剥离。

例如,如果您从CPAN或HTML::Strip(Debian / ubuntu)或libperl-html-strip(Fedora / RHEL)软件包安装perl-HTML-Strip

CREATE LANGUAGE plperlu;

CREATE OR REPLACE FUNCTION striphtml(html text) RETURNS text
LANGUAGE plperlu
AS $$
use strict; use warnings; use 5.10.1;
use HTML::Strip;

my $hs = HTML::Strip->new(decode_entities => 1);
my $stripped = $hs->parse($_[0]);
$hs->eof;
return $stripped;
$$;

然后:

regress=> SELECT striphtml('<div align="right"><font color="3366FF"><b><font size="3">it''s&nbsp;test</font></b></font></div>');
 striphtml 
-----------
 it's test
(1 row)

或者您可以使用HTML::Parser更干净地完成这项工作。

还有许多其他选择。选择一个现有的并使用它。