查找多列中出现的项目数

时间:2015-11-15 22:23:40

标签: sql postgresql

假设一个表

source_document_id BIGINT NOT NULL,
target_document_id BIGINT NOT NULL,
similarity DOUBLE NOT NULL,

如果我想确定source_document_id列中文档的出现频率,我的查询将运行如下:

select count(source_document_id) as c1, source_document_id from PROJECT_0622.SIMILARITY_RESULT group by source_document_id order by  c1 desc;

但考虑到文档Id可以出现在source_document_id或target_document_id列中,... 如何在任一列中找到documentIds的频率?

例如, 因此考虑数据

source_document_id source_document_id similarity
1 2 0.8
1 3 0.8
1 4 0.7
4 5 0.8
4 8 0.75
9 4 0.9      
2 4 0.99

我想获得类似的结果

frequency in source/target doc id column, doc_id
5 4
3 1
2 2    
1 8    
1 3
1 9
1 10

这可能吗?谢谢

3 个答案:

答案 0 :(得分:1)

您可以使用union all将两列合并为一列,然后在此汇总列上执行计数:

select doc_id, count(*) n_docs
from
(
select source_document_id as doc_id from similarity
union all
select target_document_id as doc_id from similarity
)
group by doc_id

答案 1 :(得分:1)

首先使用source_document_idtarget_document_idunion放在一列中。然后只需group by

[以下查询在SQL Server中有效,也许您需要进行一些小的更改才能使其在postgresql中工作]

select id, count(*)
from
(
    select source_document_id as id
    from PROJECT_0622.SIMILARITY_RESULT
    union all
    select target_document_id as id
    from PROJECT_0622.SIMILARITY_RESULT
) ids
group by id

答案 2 :(得分:1)

有可能两次聚合数据会有更好的表现:

select id, sum(cnt) as n_docs
from ((select source_document_id as id, count(*) as cnt
       from similarity
       group by source_document_id
      ) union all
      (select target_document_id as id
       from similarity
       group by target_document_id
      )
     ) t
group by id;

如果考虑性能,则值得尝试两种方法,特别是如果两列上有索引。