子文件夹中的累积文件数

时间:2012-11-08 09:19:51

标签: sql postgresql common-table-expression recursive-query

我有一张包含文件列表的表格。有id_folder,id_parrent_folder,大小(文件大小):

create table sample_data (
    id_folder bigint ,
    id_parrent_folder bigint,
    size bigint
);

我想知道,每个文件夹的每个子文件夹(包括当前文件夹)中有多少文件(从给定文件夹开始)。鉴于下面发布的samle数据,我希望得到以下结果:

id_folder     files
100623           35
100624           14

示例数据:

insert into sample_data values (100623,58091,60928);
insert into sample_data values (100623,58091,59904);
insert into sample_data values (100623,58091,54784);
insert into sample_data values (100623,58091,65024);
insert into sample_data values (100623,58091,25600);
insert into sample_data values (100623,58091,31744);
insert into sample_data values (100623,58091,27648);
insert into sample_data values (100623,58091,39424);
insert into sample_data values (100623,58091,30720);
insert into sample_data values (100623,58091,71168);
insert into sample_data values (100623,58091,68608);
insert into sample_data values (100623,58091,34304);
insert into sample_data values (100623,58091,46592);
insert into sample_data values (100623,58091,35328);
insert into sample_data values (100623,58091,29184);
insert into sample_data values (100623,58091,38912);
insert into sample_data values (100623,58091,38400);
insert into sample_data values (100623,58091,49152);
insert into sample_data values (100623,58091,14444);
insert into sample_data values (100623,58091,33792);
insert into sample_data values (100623,58091,14789);
insert into sample_data values (100624,100623,16873);
insert into sample_data values (100624,100623,32768);
insert into sample_data values (100624,100623,104920);
insert into sample_data values (100624,100623,105648);
insert into sample_data values (100624,100623,31744);
insert into sample_data values (100624,100623,16431);
insert into sample_data values (100624,100623,46592);
insert into sample_data values (100624,100623,28160);
insert into sample_data values (100624,100623,58650);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);

我试过使用postgresql(postgresql docs)中的示例,但它(显然)不能以这种方式工作。任何帮助表示赞赏。

- 编辑

我尝试过以下查询:

WITH RECURSIVE included_files(id_folder, parrent_folder, dist_last_change) AS (
SELECT 
    id_folder, 
    id_parrent_folder, 
    size
FROM 
    sample_data p 
WHERE 
    id_folder = 100623
UNION ALL
SELECT 
    p.id_folder, 
    p.id_parrent_folder, 
    p.size
FROM 
    included_files if, 
    sample_data p
WHERE 
    p.id_parrent_folder = if.id_folder
)
select * from included_files

这不起作用,因为每个孩子都有很多父母,因此子文件夹中的行数成倍增加。

2 个答案:

答案 0 :(得分:2)

使用您的示例数据,这将返回您想要的内容。我不是百分百肯定它会覆盖你树中所有可能的异常现象:

with recursive folder_sizes as (
   select id_folder, id_parent_folder, count(*) as num_files
   from sample_data
   group by id_folder, id_parent_folder
), 
folder_tree as (

   select id_folder, id_parent_folder, num_files as total_files
   from folder_sizes
   where id_parent_folder = 100623

   union all 

   select c.id_folder, c.id_parent_folder, c.num_files + p.total_files as total_files
   from folder_sizes c
     join folder_tree p on p.id_parent_folder = c.id_folder

)
select id_folder, id_parent_folder, total_files
from folder_tree;

这是一个SQLFiddle演示:http://sqlfiddle.com/#!12/bb942/2

这仅涵盖单级层次结构(因为id_parent_folder = 100623条件)。为了覆盖任意数量的级别,我只能想到两步法,首先收集所有子文件夹,然后再次遍历该树,以计算文件总数。

这样的事情:

with recursive folder_sizes as (
   select id_folder, id_parent_folder, count(*) as num_files
   from sample_data
   group by id_folder, id_parent_folder
), 
folder_tree_down as (
   select id_folder, id_parent_folder, num_files, id_folder as root_folder, 1 as level
   from folder_sizes

   union all 

   select c.id_folder, c.id_parent_folder, c.num_files, p.root_folder, p.level + 1 as level
   from folder_sizes c
     join folder_tree_down p on p.id_folder = c.id_parent_folder
), 
folder_tree_up as (

   select id_folder, id_parent_folder, num_files as total_files, level
   from folder_tree_down
   where root_folder = 100623

   union all 

   select c.id_folder, c.id_parent_folder, c.num_files + p.total_files as total_files, p.level
   from folder_tree_down c
     join folder_tree_up p on p.id_parent_folder = c.id_folder

)
select id_folder, id_parent_folder, total_files
from folder_tree_up
where level > 1;

它产生与第一个语句相同的输出,但我认为它应该使用无限数量的级别。

答案 1 :(得分:1)

考虑很好的问题,我投了赞成票!

我认为,2个案例要考虑:

  1. 多级路径和
  2. 多子节点。
  3. 到目前为止,我提出了以下问题:

    WITH RECURSIVE tree AS (
        SELECT id_folder id, array[id_folder] arr
          FROM sample_data sd
         WHERE NOT EXISTS (SELECT 1 FROM sample_data s
                            WHERE s.id_parrent_folder=sd.id_folder)
        UNION ALL
        SELECT sd.id_folder,t.arr||sd.id_folder
          FROM tree t
          JOIN sample_data sd ON sd.id_folder IN (
            SELECT id_parrent_folder FROM sample_data WHERE id_folder=t.id))
    ,ids AS (SELECT DISTINCT id, unnest(arr) ua FROM tree)
    ,agg AS (SELECT id_folder id,count(*) cnt FROM sample_data GROUP BY 1)
    SELECT ids.id, sum(agg.cnt)
      FROM ids JOIN agg ON ids.ua=agg.id
     GROUP BY 1
     ORDER BY 1;
    

    我已将以下行添加到sample_data

    INSERT INTO sample_data VALUES (100625,100623,123);
    INSERT INTO sample_data VALUES (100625,100623,456);
    INSERT INTO sample_data VALUES (100625,100623,789);
    INSERT INTO sample_data VALUES (100626,100625,1);
    

    此查询不是最佳的,并且随着行数的增加而减慢。


    全面测试

    为了模拟原始情况,我做了一个小的python脚本,它扫描文件系统并将其存储到数据库中(因此延迟,我还不擅长python脚本)。

    已创建以下表格:

    CREATE TABLE fs_file(file_id bigserial, name text, type char(1), level int4);
    CREATE TABLE fs_tree(file_id int8, parent_id int8, size int8);
    

    扫描我的MBP的整个文件系统花了7.5分钟,我在fs_tree表中有870k条目,这与原始任务非常相似。上传后,运行以下内容:

    CREATE INDEX i_fs_tree_1 ON fs_tree(file_id);
    CREATE INDEX i_fs_tree_2 ON fs_tree(parent_id);
    VACUUM ANALYZE fs_file;
    VACUUM ANALYZE fs_tree;
    

    我已尝试对此数据运行我的第一个查询,并且必须在aprx 1小时后将其删除。改进的一个需要2分钟(在我的MBP上)在整个文件系统上完成工作。它来了:

    WITH RECURSIVE descent AS (
        SELECT fs.file_id grp, fs.file_id, fs.size, 1 k, 0 AS lvl
          FROM fs_tree fs
         WHERE fs.parent_id = (SELECT file_id FROM fs_file WHERE name = '/')
        UNION ALL
        SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
               fs.file_id, fs.size, k.k, d.lvl+1
          FROM descent d
          JOIN fs_tree fs ON d.file_id=fs.parent_id
          CROSS JOIN generate_series(0,1) k(k))
    /* the query */
    SELECT grp, file_id, size, k, lvl
      FROM descent
     ORDER BY 1,2,3;
    

    查询使用我的表名,但不应该更改它。它将为file_id中找到的每个fs_tree构建一组组。要获得所需的输出,您可以执行以下操作:

    SELECT grp AS file_id, count(*), sum(size)
      FROM descent GROUP BY 1;
    

    一些注意事项:

      如果没有重复,
    1. 查询只会 。我认为这是一种正确的方法,因为在一个目录中不可能有2个同名的条目;
    2. 查询并不关心树的深度或兄弟数,尽管这会对性能产生影响;
    3. 对我来说这是一次很好的体验,因为任务规划系统也需要类似的功能(我现在正在与一个人合作);
    4. 在考虑任务时,单个条目可以有多个父级(但不是其他方式),查询仍然有效;
    5. 这个问题也可以通过其他方式解决,比如以递增的顺序遍历树,或者使用预先计算的值来避免最后的分组步骤,但这比一个简单的问题要大一点,所以我活了作为锻炼对你而言。
    6. <强>推荐

      要使此查询有效,您应该通过聚合来准备数据:

      WITH RECURSIVE
      fs_tree AS (
          SELECT id_folder file_id, id_parrent_folder parent_id,
                 sum(size) AS size, count(*) AS cnt
            FROM sample_data GROUP BY 1,2)
      ,descent AS (
          SELECT fs.file_id grp, fs.file_id, fs.size, fs.cnt, 1 k, 0 AS lvl
            FROM fs_tree fs
           WHERE fs.parent_id = 58091
          UNION ALL
          SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
                 fs.file_id, fs.size, fs.cnt, k.k, d.lvl+1
            FROM descent d
            JOIN fs_tree fs ON d.file_id=fs.parent_id
            CROSS JOIN generate_series(0,1) k(k))
      /* the query */
      SELECT grp file_id, sum(size) size, sum(cnt) cnt
        FROM descent
       GROUP BY 1
       ORDER BY 1,2,3;
      

      为了加快速度,您可以实施Materialized Views并预先计算一些指标。


      示例数据

      这是一个小型转储,它将显示表格中的数据:

      INSERT INTO fs_file VALUES (1, '/Users/viy/prj/logs', 'D', 0),
          (2, 'jobs', 'D', 1),
          (3, 'pg_csv_load', 'F', 2),
          (4, 'pg_logs', 'F', 2),
          (5, 'logs.sql', 'F', 1),
          (6, 'logs.sql~', 'F', 1),
          (7, 'pgfouine-1.2.tar.gz', 'F', 1),
          (8, 'u.sql', 'F', 1),
          (9, 'u.sql~', 'F', 1);
      
      INSERT INTO fs_tree VALUES (1, NULL, 0),
          (2, 1, 0),
          (3, 2, 936),
          (4, 2, 706),
          (5, 1, 4261),
          (6, 1, 4261),
          (7, 1, 793004),
          (8, 1, 491),
          (9, 1, 491);
      

      注意,我稍微更新了创建语句。

      这是我用来扫描文件系统的脚本:

      #!/usr/bin/python
      
      import os
      import psycopg2
      import sys
      from stat import *
      
      def walk_tree(full, parent, level, call_back):
          '''recursively descend the directory tree rooted at top,
             calling the callback function for each regular file'''
      
          if not os.access(full, os.R_OK):
              return
      
          for f in os.listdir(full):
              path = os.path.join(full, f)
              if os.path.islink(path):
                  # It's a link, register and continue
                  e = entry(f, "L", level)
                  call_back(parent, e, 0)
                  continue
      
              mode = os.stat(path).st_mode
              if S_ISDIR(mode):
                  e = entry(f, "D", level)
                  call_back(parent, e, 0)
                  # It's a directory, recurse into it
                  try:
                      walk_tree(path, e, level+1, call_back)
                  except OSError:
                      pass
      
              elif S_ISREG(mode):
                  # It's a file, call the callback function
                  call_back(parent, entry(f, "F", level), os.stat(path).st_size)
              else:
                  # It's unknown, just register
                  e = entry(f, "U", level)
                  call_back(parent, e, 0)
      
      def register(parent, entry, size):
          db_cur.execute("INSERT INTO fs_tree VALUES (%s,%s,%s)",
                         (entry, parent, size))
      
      def entry(name, type, level):
          db_cur.execute("""INSERT INTO fs_file(name,type, level)
                         VALUES (%s, %s, %s) RETURNING file_id""",
                         (name, type, level))
          return db_cur.fetchone()[0]
      
      db_con=psycopg2.connect("dbname=postgres")
      db_cur=db_con.cursor()
      
      if len(sys.argv) != 2:
          raise SyntaxError("Root directory expected!")
      
      if not S_ISDIR(os.stat(sys.argv[1]).st_mode):
          raise SyntaxError("A directory is wanted!")
      
      e=entry(sys.argv[1], "D", 0)
      register(None, e, 0)
      walk_tree(sys.argv[1], e, 1, register)
      
      db_con.commit()
      
      db_cur.close()
      db_con.close()
      

      此脚本适用于Python 3.2,基于official python documentation的示例。

      希望这能为你澄清事情。