我有一张包含文件列表的表格。有id_folder,id_parrent_folder,大小(文件大小):
create table sample_data (
id_folder bigint ,
id_parrent_folder bigint,
size bigint
);
我想知道,每个文件夹的每个子文件夹(包括当前文件夹)中有多少文件(从给定文件夹开始)。鉴于下面发布的samle数据,我希望得到以下结果:
id_folder files
100623 35
100624 14
示例数据:
insert into sample_data values (100623,58091,60928);
insert into sample_data values (100623,58091,59904);
insert into sample_data values (100623,58091,54784);
insert into sample_data values (100623,58091,65024);
insert into sample_data values (100623,58091,25600);
insert into sample_data values (100623,58091,31744);
insert into sample_data values (100623,58091,27648);
insert into sample_data values (100623,58091,39424);
insert into sample_data values (100623,58091,30720);
insert into sample_data values (100623,58091,71168);
insert into sample_data values (100623,58091,68608);
insert into sample_data values (100623,58091,34304);
insert into sample_data values (100623,58091,46592);
insert into sample_data values (100623,58091,35328);
insert into sample_data values (100623,58091,29184);
insert into sample_data values (100623,58091,38912);
insert into sample_data values (100623,58091,38400);
insert into sample_data values (100623,58091,49152);
insert into sample_data values (100623,58091,14444);
insert into sample_data values (100623,58091,33792);
insert into sample_data values (100623,58091,14789);
insert into sample_data values (100624,100623,16873);
insert into sample_data values (100624,100623,32768);
insert into sample_data values (100624,100623,104920);
insert into sample_data values (100624,100623,105648);
insert into sample_data values (100624,100623,31744);
insert into sample_data values (100624,100623,16431);
insert into sample_data values (100624,100623,46592);
insert into sample_data values (100624,100623,28160);
insert into sample_data values (100624,100623,58650);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
我试过使用postgresql(postgresql docs)中的示例,但它(显然)不能以这种方式工作。任何帮助表示赞赏。
- 编辑
我尝试过以下查询:
WITH RECURSIVE included_files(id_folder, parrent_folder, dist_last_change) AS (
SELECT
id_folder,
id_parrent_folder,
size
FROM
sample_data p
WHERE
id_folder = 100623
UNION ALL
SELECT
p.id_folder,
p.id_parrent_folder,
p.size
FROM
included_files if,
sample_data p
WHERE
p.id_parrent_folder = if.id_folder
)
select * from included_files
这不起作用,因为每个孩子都有很多父母,因此子文件夹中的行数成倍增加。
答案 0 :(得分:2)
使用您的示例数据,这将返回您想要的内容。我不是百分百肯定它会覆盖你树中所有可能的异常现象:
with recursive folder_sizes as (
select id_folder, id_parent_folder, count(*) as num_files
from sample_data
group by id_folder, id_parent_folder
),
folder_tree as (
select id_folder, id_parent_folder, num_files as total_files
from folder_sizes
where id_parent_folder = 100623
union all
select c.id_folder, c.id_parent_folder, c.num_files + p.total_files as total_files
from folder_sizes c
join folder_tree p on p.id_parent_folder = c.id_folder
)
select id_folder, id_parent_folder, total_files
from folder_tree;
这是一个SQLFiddle演示:http://sqlfiddle.com/#!12/bb942/2
这仅涵盖单级层次结构(因为id_parent_folder = 100623
条件)。为了覆盖任意数量的级别,我只能想到两步法,首先收集所有子文件夹,然后再次遍历该树,以计算文件总数。
这样的事情:
with recursive folder_sizes as (
select id_folder, id_parent_folder, count(*) as num_files
from sample_data
group by id_folder, id_parent_folder
),
folder_tree_down as (
select id_folder, id_parent_folder, num_files, id_folder as root_folder, 1 as level
from folder_sizes
union all
select c.id_folder, c.id_parent_folder, c.num_files, p.root_folder, p.level + 1 as level
from folder_sizes c
join folder_tree_down p on p.id_folder = c.id_parent_folder
),
folder_tree_up as (
select id_folder, id_parent_folder, num_files as total_files, level
from folder_tree_down
where root_folder = 100623
union all
select c.id_folder, c.id_parent_folder, c.num_files + p.total_files as total_files, p.level
from folder_tree_down c
join folder_tree_up p on p.id_parent_folder = c.id_folder
)
select id_folder, id_parent_folder, total_files
from folder_tree_up
where level > 1;
它产生与第一个语句相同的输出,但我认为它应该使用无限数量的级别。
答案 1 :(得分:1)
考虑很好的问题,我投了赞成票!
我认为,2个案例要考虑:
到目前为止,我提出了以下问题:
WITH RECURSIVE tree AS (
SELECT id_folder id, array[id_folder] arr
FROM sample_data sd
WHERE NOT EXISTS (SELECT 1 FROM sample_data s
WHERE s.id_parrent_folder=sd.id_folder)
UNION ALL
SELECT sd.id_folder,t.arr||sd.id_folder
FROM tree t
JOIN sample_data sd ON sd.id_folder IN (
SELECT id_parrent_folder FROM sample_data WHERE id_folder=t.id))
,ids AS (SELECT DISTINCT id, unnest(arr) ua FROM tree)
,agg AS (SELECT id_folder id,count(*) cnt FROM sample_data GROUP BY 1)
SELECT ids.id, sum(agg.cnt)
FROM ids JOIN agg ON ids.ua=agg.id
GROUP BY 1
ORDER BY 1;
我已将以下行添加到sample_data
:
INSERT INTO sample_data VALUES (100625,100623,123);
INSERT INTO sample_data VALUES (100625,100623,456);
INSERT INTO sample_data VALUES (100625,100623,789);
INSERT INTO sample_data VALUES (100626,100625,1);
此查询不是最佳的,并且随着行数的增加而减慢。
全面测试
为了模拟原始情况,我做了一个小的python脚本,它扫描文件系统并将其存储到数据库中(因此延迟,我还不擅长python脚本)。
已创建以下表格:
CREATE TABLE fs_file(file_id bigserial, name text, type char(1), level int4);
CREATE TABLE fs_tree(file_id int8, parent_id int8, size int8);
扫描我的MBP的整个文件系统花了7.5分钟,我在fs_tree
表中有870k条目,这与原始任务非常相似。上传后,运行以下内容:
CREATE INDEX i_fs_tree_1 ON fs_tree(file_id);
CREATE INDEX i_fs_tree_2 ON fs_tree(parent_id);
VACUUM ANALYZE fs_file;
VACUUM ANALYZE fs_tree;
我已尝试对此数据运行我的第一个查询,并且必须在aprx 1小时后将其删除。改进的一个需要2分钟(在我的MBP上)在整个文件系统上完成工作。它来了:
WITH RECURSIVE descent AS (
SELECT fs.file_id grp, fs.file_id, fs.size, 1 k, 0 AS lvl
FROM fs_tree fs
WHERE fs.parent_id = (SELECT file_id FROM fs_file WHERE name = '/')
UNION ALL
SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
fs.file_id, fs.size, k.k, d.lvl+1
FROM descent d
JOIN fs_tree fs ON d.file_id=fs.parent_id
CROSS JOIN generate_series(0,1) k(k))
/* the query */
SELECT grp, file_id, size, k, lvl
FROM descent
ORDER BY 1,2,3;
查询使用我的表名,但不应该更改它。它将为file_id
中找到的每个fs_tree
构建一组组。要获得所需的输出,您可以执行以下操作:
SELECT grp AS file_id, count(*), sum(size)
FROM descent GROUP BY 1;
一些注意事项:
<强>推荐强>
要使此查询有效,您应该通过聚合来准备数据:
WITH RECURSIVE
fs_tree AS (
SELECT id_folder file_id, id_parrent_folder parent_id,
sum(size) AS size, count(*) AS cnt
FROM sample_data GROUP BY 1,2)
,descent AS (
SELECT fs.file_id grp, fs.file_id, fs.size, fs.cnt, 1 k, 0 AS lvl
FROM fs_tree fs
WHERE fs.parent_id = 58091
UNION ALL
SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
fs.file_id, fs.size, fs.cnt, k.k, d.lvl+1
FROM descent d
JOIN fs_tree fs ON d.file_id=fs.parent_id
CROSS JOIN generate_series(0,1) k(k))
/* the query */
SELECT grp file_id, sum(size) size, sum(cnt) cnt
FROM descent
GROUP BY 1
ORDER BY 1,2,3;
为了加快速度,您可以实施Materialized Views并预先计算一些指标。
示例数据
这是一个小型转储,它将显示表格中的数据:
INSERT INTO fs_file VALUES (1, '/Users/viy/prj/logs', 'D', 0),
(2, 'jobs', 'D', 1),
(3, 'pg_csv_load', 'F', 2),
(4, 'pg_logs', 'F', 2),
(5, 'logs.sql', 'F', 1),
(6, 'logs.sql~', 'F', 1),
(7, 'pgfouine-1.2.tar.gz', 'F', 1),
(8, 'u.sql', 'F', 1),
(9, 'u.sql~', 'F', 1);
INSERT INTO fs_tree VALUES (1, NULL, 0),
(2, 1, 0),
(3, 2, 936),
(4, 2, 706),
(5, 1, 4261),
(6, 1, 4261),
(7, 1, 793004),
(8, 1, 491),
(9, 1, 491);
注意,我稍微更新了创建语句。
这是我用来扫描文件系统的脚本:
#!/usr/bin/python
import os
import psycopg2
import sys
from stat import *
def walk_tree(full, parent, level, call_back):
'''recursively descend the directory tree rooted at top,
calling the callback function for each regular file'''
if not os.access(full, os.R_OK):
return
for f in os.listdir(full):
path = os.path.join(full, f)
if os.path.islink(path):
# It's a link, register and continue
e = entry(f, "L", level)
call_back(parent, e, 0)
continue
mode = os.stat(path).st_mode
if S_ISDIR(mode):
e = entry(f, "D", level)
call_back(parent, e, 0)
# It's a directory, recurse into it
try:
walk_tree(path, e, level+1, call_back)
except OSError:
pass
elif S_ISREG(mode):
# It's a file, call the callback function
call_back(parent, entry(f, "F", level), os.stat(path).st_size)
else:
# It's unknown, just register
e = entry(f, "U", level)
call_back(parent, e, 0)
def register(parent, entry, size):
db_cur.execute("INSERT INTO fs_tree VALUES (%s,%s,%s)",
(entry, parent, size))
def entry(name, type, level):
db_cur.execute("""INSERT INTO fs_file(name,type, level)
VALUES (%s, %s, %s) RETURNING file_id""",
(name, type, level))
return db_cur.fetchone()[0]
db_con=psycopg2.connect("dbname=postgres")
db_cur=db_con.cursor()
if len(sys.argv) != 2:
raise SyntaxError("Root directory expected!")
if not S_ISDIR(os.stat(sys.argv[1]).st_mode):
raise SyntaxError("A directory is wanted!")
e=entry(sys.argv[1], "D", 0)
register(None, e, 0)
walk_tree(sys.argv[1], e, 1, register)
db_con.commit()
db_cur.close()
db_con.close()
此脚本适用于Python 3.2,基于official python documentation的示例。
希望这能为你澄清事情。