Question

我有一个大字典，格式如下：

dict["randomKey"]=[dict1,dict2,int,string]

可能会有成千上万的密钥。 dict1本身有~100个键。

问题：我需要将此词典存储在服务器上并由多台计算机读取。什么是最好的格式？

我现在使用的shelve非常容易使用。但是，我需要从主词典（dict）获取来自dict1或dict2的某个键的特定值的所有键，这需要一些时间，我担心当字典会更大，就像在50k键中一样，它会花费很长时间。我读过关于sqlite3的内容，这似乎是一个不错的选择，但我不知道它是否能满足我的需求。

我真的不需要数据库可以被Python程序以外的其他程序访问（虽然它会很好），但我需要它快速，稳定并且能够在同一时间从中读取许多计算机。谢谢！

Answer 1

我选择一个具有原生json支持的数据库，它可以有效地搜索json词典。我喜欢PostgreSQL：

您的数据表：

create table dict (
  key text primary key,
  dict1 jsonb not null default '{}',
  dict2 jsonb not null default '{}',
  intval integer not null,
  strval text not null
);

填写一些样本值：

insert into dict
select
  i::text,
  (select
    jsonb_object(
      array_agg('k'||v::text),
      array_agg('v'||(v+i)::text)
    ) from generate_series(1,1000) as v
  ),
  (select
    jsonb_object(
      array_agg('k'||v::text),
      array_agg('v'||(v+i)::text)
    ) from generate_series(1,1000) as v
  ),
  i,
  i::text
from generate_series(1,10000) as i;

获取v134中密钥k6的值为dict1的密钥：

select key from dict where dict1 @> '{"k6":"v134"}';
 key 
-----
 128
(1 row)

Time: 232.843 ms

如果您的表格非常大，您甚至可以索引字典列以便更快地搜索。但是这些索引会比表本身大，数据库可以决定不使用它们更安全：

create index dict_dict1_idx on dict using gin(dict1);
create index dict_dict2_idx on dict using gin(dict2);

如果你知道它是有益的，你可以强制数据库使用索引：

set enable_seqscan=off;
select key from dict where dict1 @> '{"k6":"v134"}';
 key 
-----
 128
(1 row)

Time: 8.955 ms

适当选择由字典组成的数据库

1 个答案: