SQL hashtags正则表达式到表

时间:2012-05-01 16:17:58

标签: sql regex postgresql tokenize

我有这张桌子:

p_id        name      skills
1         Sam       #IT #communication #administration
2         Alex      #French #Trainer 

我想要一个SQL查询来输出这个

  ID     p_fid   skill
   1      1       IT
   2      1       communication
   3      1       administration 
   4      2       French
   5      2       Trainer

使用postgresql

非常感谢

2 个答案:

答案 0 :(得分:1)

如果您可以将MS SQL Server用作RDBMS,并且skills列除了主题标签和单个空格之外不包含任何其他内容,您可以将skills列转换为XML字符串然后使用SQL Server的内置XML操作函数将此字符串拆分为单独的行。

以下是适用于您在问题中指定的数据样本的方法。

create table people_skills
(
    p_id int identity(1, 1) primary key clustered,
    name nvarchar(200),
    skills nvarchar(1000)
)

go

insert into people_skills (name, skills) values ('Sam', '#IT #communication #administration')
insert into people_skills (name, skills) values ('Alex', '#French #Trainer')

go

select
    row_number() over (order by ps.p_id) as ID,
    ps.p_id as p_fid,
    cast(x.skill_node.query('text()') as nvarchar(100)) as skill
from
    (
        select
            *,
            -- Assuming that there are no leading and trailing spaces and that all hashtags are separated by single space.
            (cast('<skills>' + (replace(replace(skills, '#', '<skill>'), ' ', '</skill>')) + '</skill></skills>' as xml)) skills_xml
        from
            people_skills
    ) ps
cross apply
    ps.skills_xml.nodes('/skills/skill') as x(skill_node)

如果skills列可以包含除了主题标签和空格之外的其他信息,那么您可能需要一个“更智能”的算法来将skills转换为XML,而不是上面使用的算法。

答案 1 :(得分:1)

这样的事情:

CREATE TABLE testbed (p_id int4,name varchar(50),skills text);
INSERT INTO testbed VALUES
    (1,'Sam','#IT #communication #administration'),
    (2,'Alex','#French #Trainer');

SELECT row_number() OVER () AS id,
       p_fid, skill
  FROM (SELECT
        p_id AS p_fid,
        regexp_split_to_table(
             regexp_replace(skills, '^#', ''),
             '[ ]+#') AS skill FROM testbed) AS s;

请查看Window的文档, String manipulationArray函数。

如果您确实需要控制技能的位置,则需要更复杂的查询:

WITH arrays AS (
    SELECT p_id,
           regexp_split_to_array(regexp_replace(skills, '^#', ''), '[ ]+#') arr
      FROM testbed
), series AS (
    SELECT p_id, generate_series(1, array_upper(arr, 1)) i
      FROM arrays
)
SELECT row_number() OVER (ORDER BY a.p_id, s.i) AS id,
       a.p_id AS p_fid,
       a.arr[s.i] AS skill
  FROM arrays a
  JOIN series s ON a.p_id = s.p_id
 ORDER BY a.p_id, s.i;