Question

我有一个时间序列表（在Postgres数据库中），列

item_id,  country_id,  year,  month, value

在此表中有重复的时间序列：它们具有相同的country_id和时间序列日期/值，但已分配了不同的item_id，例如：＆＃39; Red Apples＆＃39;和＆＃39;苹果，红＆＃39;

如何识别这些重复的时间序列？我希望（country_id，year，month，value）匹配项目的所有日期。

我是初学者，所以请原谅我遗漏的任何细节。我主要是寻找概念方法 - 我可以在Postgres或python / Pandas中实现它。

因此，例如，我希望能够检测到类似的内容：

item_id,     country_id,     year,     month,    value
-------------------------------------------------------
Red Apples   5               1996      1         300
Red Apples   5               1996      2         500
Red Apples   5               1996      3         370
Apples, Red  5               1996      1         300
Apples, Red  5               1996      2         500
Apples, Red  5               1996      3         370

我希望输出看起来像这样：

item_id1,     item_id2,      country_id,     year,     month_range
-----------------------------------------------------------------
Red Apples    Apples, Red         5          1996       [1,3]

这样的事情也可以：

item_id1,     item_id2,      country_id,     year,     time_month,   value
--------------------------------------------------------------------------
Red Apples    Apples, Red         5          1996         1           300
Red Apples    Apples, Red         5          1996         2           500
Red Apples    Apples, Red         5          1996         3           370

我想过尝试这样的事情：

select distinct A.country_id, A.item_id, B.item_id, A.year, A.month, A.value
                      from my_table as A,
                      my_table as B 
                      where
                      (A.country_id=B.country_id and 
                      A.item_id<>B.item_id and 
                      A.year=B.year and 
                      A.month=B.month and 
                      A.value=B.value )

然后我会检查以确保所标识的每个item_id对都显示所有日期/值。但是如果可能的话，我想立刻检查所有日期/值。

我不确定表连接是否合适......？

Answer 1

_{请参阅下面的更新！}

除非您提供有关样本数据和预期结果的更多详细信息，否则我认为以下查询可能有所帮助：

SELECT country_id,  year,  month, value
  FROM a_table
 GROUP BY country_id,  year,  month, value
HAVING count(*) > 1;

此查询将显示除item_id以外的所有条目。如果您要查找与重复组对应的所有行，请使用以下查询：

SELECT item_id, country_id,  year,  month, value
  FROM a_table
 WHERE (country_id,  year,  month, value)
    IN (
    SELECT country_id,  year,  month, value
      FROM a_table
     GROUP BY country_id,  year,  month, value
    HAVING count(*) > 1)
 ORDER BY country_id,  year,  month, value, item_id;

我已经将列item_id作为排序顺序中的最后一列，它应该使其更容易识别重复项。随意调整。此查询可能需要一段时间，具体取决于您的数据。

为了避免将来出现此类情况（重复日期），您可能需要创建一个Unique约束，如下所示：

ALTER TABLE a_table ADD CONSTRAIN u_cymv
    UNIQUE (country_id,  year,  month, value);

修改添加评论后，我提出了以下查询以查找重复项目系列：

WITH a_table(item_id,country_id,year,month,value) AS (VALUES ('Red Apples'::text,5,1996,1,300::numeric), ('Red Apples',5,1996,2,500), ('Red Apples',5,1996,3,370), ('Apples, Red',5,1996,1,300), ('Apples, Red',5,1996,2,500), ('Apples, Red',5,1996,3,370) ), dups AS ( SELECT string_agg(item_id,'/') AS items, country_id,value, daterange(to_date(year::text||month,'YYYYMM'), (to_date(year::text||month,'YYYYMM') +INTERVAL'1mon')::date,'[)') AS range FROM a_table GROUP BY country_id,year,month,value HAVING count(*) > 1 ) SELECT grp,count(*),items,country_id, daterange(min(lower(range)), max(upper(range)), '[)') r, array_agg(value) FROM ( SELECT items,country_id,range,value, sum(g) OVER (ORDER BY country_id, range) grp FROM ( SELECT items,country_id, range,value, CASE WHEN lag(range) OVER (PARTITION BY country_id ORDER BY range) -|- range THEN NULL ELSE 1 END g FROM dups) s ) s GROUP BY grp,country_id,items HAVING count(*) >= 3 ORDER BY country_id,r,items;

它的作用：

a_table是提供的示例数据的副本;

dups是找到重复记录的人。我还将year,month列转换为daterange，因为我认为没有其他方法可以正确找到穿越纽约的系列;

概述了重复项后，我将以前的range（country_id内）与当前的adjacent进行比较，如果它们不是running total effect，则设置了组标记g;

接下来，我使用sum()函数的{{3}}来创建组标识grp。对于样本数据，这只产生一个组;

最后，我使用grp为GROUP BY将数据分组。我还在country_id中添加了items和GROUP BY，但这只是为了避免将它们包含在聚合函数中 - 无论如何它们将是唯一的grp。我还组建了一个新的daterange列，这是因为range类型没有内置的聚合函数。

在执行此查询之前，您可能需要增加work_mem，最多1GB我说（取决于真实表格中的行数）。请尝试一下，让我知道它是否适合您。如果您可以为此分享EXPLAIN (analyze, buffers)，那就太好了。

Answer 2

SELECT *
来自my_table
GROUP BY country_id，年，月，值
HAVING count（item_id）＆gt; 1

！这是未经测试的！

识别Postgres中的重复时间序列序列

2 个答案: