Question

经过一些转换后，我得到了交叉连接的结果（来自表a和b），我想对其进行一些分析。这个表格如下：

+-----+------+------+------+------+-----+------+------+------+------+
| id  | 10_1 | 10_2 | 11_1 | 11_2 | id  | 10_1 | 10_2 | 11_1 | 11_2 |
+-----+------+------+------+------+-----+------+------+------+------+
| 111 |    1 |    0 |    1 |    0 | 222 |    1 |    0 |    1 |    0 |
| 111 |    1 |    0 |    1 |    0 | 333 |    0 |    0 |    0 |    0 |
| 111 |    1 |    0 |    1 |    0 | 444 |    1 |    0 |    1 |    1 |
| 112 |    0 |    1 |    1 |    0 | 222 |    1 |    0 |    1 |    0 |
+-----+------+------+------+------+-----+------+------+------+------+

第一列中的ID与第六列中的ID不同。连续始终是两个彼此匹配的不同ID。其他列的值始终为0或1。

我现在正试图找出有多少值（意味着两者都有＆＃34; 1＆＃34;在10_1,10_2等）两个ID平均有共同点，但我真的不知道如何这样做。

我尝试这样的事情作为开始：

SELECT SUM(CASE WHEN a.10_1 = 1 AND b.10_1 = 1 then 1 end)

但这显然只计算两个ID共有10_1的频率。我可以为不同的列制作类似的东西：

SELECT SUM(CASE WHEN (a.10_1 = 1 AND b.10_1 = 1) 
OR (a.10_2 = 1 AND b.10_1 = 1) OR [...] then 1 end)

一般来说，两个ID有多少共同点，但这当然也可以计算它们是否有两个或两个以上的共同点。另外，我还想知道两个IDS有两个共同点，三个共同点。

一个问题＆＃34;在我的情况下，我也想要看看~30列，所以我很难为每个案例写下每个可能的组合。

有谁知道如何更好地解决我的问题？提前谢谢。

编辑：可能的结果可能如下所示：

+-----------+---------+
| in_common |  count  |
+-----------+---------+
|         0 |     100 |
|         1 |     500 |
|         2 |    1500 |
|         3 |    5000 |
|         4 |    3000 |
+-----------+---------+

Answer 1

使用代码作为列名，您将不得不编写一些显式引用每个列名的代码。为了将其保持在最低限度，您可以在单个union语句中编写这些引用来规范化数据，例如：

select id, '10_1' where "10_1" = 1
union
select id, '10_2' where "10_2" = 1
union
select id, '11_1' where "11_1" = 1
union
select id, '11_2' where "11_2" = 1;

需要对其进行修改，以包含链接不同ID所需的任何其他列。出于说明的目的，我假设以下数据模型

create table p (
    id integer not null primary key,
    sex character(1) not null,
    age integer not null
    );

create table t1 (
    id integer not null,
    code character varying(4) not null,
    constraint pk_t1 primary key (id, code)
    );

虽然您的数据显然目前与此结构不相似，但将数据规范化为此类表格可让您应用以下解决方案以所需形式汇总数据。

select
    in_common,
    count(*) as count
from (
    select
        count(*) as in_common
    from (
        select
        a.id as a_id, a.code,
        b.id as b_id, b.code
        from
        (select p.*, t1.code
            from p left join t1 on p.id=t1.id
            ) as a
        inner join (select p.*, t1.code
            from p left join t1 on p.id=t1.id
            ) as b on b.sex <> a.sex and b.age between a.age-10 and a.age+10
        where
        a.id < b.id
        and a.code = b.code
        ) as c
    group by
        a_id, b_id
    ) as summ
group by
    in_common;

Answer 2

建议的解决方案首先需要从交叉连接表中退一步，因为相同的列名称非常烦人。相反，我们从两个表中获取id并将它们放在临时表中。以下查询获取问题中所需的结果。它假设问题中的table_a和table_b相同且称为tbl，但不需要此假设，tbl可以替换为table_a和{ {1}}在两个子SELECT查询中。它看起来很复杂，并使用JSON技巧来展平列，但它可以在这里工作：

table_b

此处的输出如下所示：

WITH idtable AS (
SELECT a.id as id_1, b.id as id_2 FROM
   -- put cross join of table a and table b here
)
SELECT in_common,
       count(*)
FROM
  (SELECT idtable.*,
          sum(CASE
                  WHEN meltedR.value::text=meltedL.value::text THEN 1
                  ELSE 0
              END) AS in_common
   FROM idtable
   JOIN
     (SELECT tbl.id,
             b.*
      FROM tbl,                         -- change here to table_a
           json_each(row_to_json(tbl)) b          -- and here too
      WHERE KEY<>'id' ) meltedL ON (idtable.id_1 = meltedL.id)
   JOIN
     (SELECT tbl.id,
             b.*
      FROM tbl,                         -- change here to table_b
           json_each(row_to_json(tbl)) b          -- and here too
      WHERE KEY<>'id' ) meltedR ON (idtable.id_2 = meltedR.id
                                    AND meltedL.key = meltedR.key)
   GROUP BY idtable.id_1,
            idtable.id_2) tt
GROUP BY in_common ORDER BY in_common;

在SQL中为每行

2 个答案: