Question

在实体用户，user_profiles和user_custom_profiles之间进行分析时，我需要一个内部联接，这导致一个大的，大约500列的实体，这些表之间的关系是1到1。

然后，我在一个扁平化的表中转换了用户，该表使用 SET USING 创建了约350列，并从其他两个表中获取数据。我没有使用 DEFAULT ，因为所有这些表每天都会更新，因此这些 SET USING列需要每天更新。用户的创建语句表如下所示：

CREATE TABLE public.users
(
    user_id varchar(100) NOT NULL,
    tenant_id int NOT NULL,
    user_domain varchar(100) not null,
    import_file_id int DEFAULT NULL::int,
    target_id int DEFAULT NULL::int,
    customer_id varchar(100) DEFAULT NULL,
    loyalty_id varchar(100) DEFAULT NULL,+
    [...]
    -- columns from user_profiles table
    customer_base varchar(100) SET USING (
      select customer_base 
      from user_profiles 
      where users.tenant_id = user_profiles.tenant_id 
        and users.user_id = user_profiles.user_id 
        and users.user_domain = user_profiles.user_domain
    ),
    purchases int SET USING (
      select purchases 
      from user_profiles 
      where users.tenant_id = user_profiles.tenant_id 
      and users.user_id = user_profiles.user_id 
      and users.user_domain = user_profiles.user_domain
    ),
    customer_type INT SET USING (
      select customer_type 
      from user_profiles 
      where users.tenant_id = user_profiles.tenant_id 
      and users.user_id = user_profiles.user_id 
      and users.user_domain = user_profiles.user_domain
    ),
    [...]
    -- columns from user_custom_profiles table
    ucp_custom_11 VARCHAR(100) SET USING (
      select custom_11 
      from user_custom_profiles 
      where users.tenant_id = user_custom_profiles.tenant_id 
      and users.user_id = user_custom_profiles.user_id 
      and users.user_domain = user_custom_profiles.user_domain
    ),
    ucp_custom_12 VARCHAR(100) SET USING (
      select custom_12 from user_custom_profiles 
      where users.tenant_id = user_custom_profiles.tenant_id 
      and users.user_id = user_custom_profiles.user_id 
      and users.user_domain = user_custom_profiles.user_domain
    ),
    ucp_custom_13 VARCHAR(100) SET USING (
      select custom_13 from user_custom_profiles 
      where users.tenant_id = user_custom_profiles.tenant_id 
      and users.user_id = user_custom_profiles.user_id 
      and users.user_domain = user_custom_profiles.user_domain
    ),
    [...]
);

到此为止一切正常，问题是当我尝试执行SELECT REFRESH_COLUMNS('users_7', '', 'REBUILD');更新所有列时，该函数似乎需要大量内存，并且失败并出现以下错误：

SQL Error [3815] [53200]: [Vertica][VJDBC](3815) ROLLBACK: 
Join inner did not fit in memory [(public.users_super x public.user_custom_profiles) 
using previous join and subquery (PATH ID: 2)]

我已经测试过执行此操作，其中有几列要更新并可以正常运行。但是我想做得更简单，我不知道Vertica在后台执行的操作，但似乎正在尝试将内存中用户，user_profiles和user_custom_profiles之间的联接结果加载到内存中。我已经为用户和user_profiles和user_custom_profiles之间的联接创建了投影。

真正令我感到不安的是，那些表没有太多数据，我使用了此处提供的查询：table-size来找出那些表的压缩大小并且不是那么大。

用户：0.4 Gb（230万行）
用户个人资料：0.2 Gb（220万行）
user_custom_profiles：0,01 Gb（220万行）

我在具有6个内核和60 Gb RAM的单个节点中使用Vertica CE 9.1。

有没有一种方法可以改善此功能，因此不会使用那么多的内存？

Answer 1

您的联接列始终为：

user_id varchar(100) NOT NULL,
tenant_id int NOT NULL,
user_domain varchar(100) not null,

对于这种类型的联接，您将必须期望所有联接列都必须实现。

我希望刷新350列中的每列的哈希联接。尝试使用其中一个联接解释SELECT，然后将其发布在这里...

即使VARCHAR可以包含从零字节到最大可能长度的任何内容，Vertica也不预先知道每个VARCHAR将持续多长时间。因此，它将使用要连接的每一行的最大可能长度，为350个必要的连接中的每一个分配一个哈希表。

那将是： 350个联接*（（user_id为100字节+ tenant_id为8字节+ user_domain为100字节）* 220万行。

如果我正确地进行数学运算，则相当于160.160 GB的内存。这几乎是您一次性获得的三倍。

我的建议：

在可能的情况下，尽可能避免使用具有数百列的表。
如果您经常联接表（通常350个派生列就足够了），请重新设计模型以允许对整数进行等价联接。使用HASH(user_id,tenant_id,user_domain)获取代理整数键（哈希冲突风险低到足以做到这一点），或者为我在下面显示的3个表中的每一个创建一个辅助表，然后将代理键获取到3个表中。然后，您可以在INTEGER上使用等价联接来联接。 350个哈希表中的每个条目需要8个字节而不是208个。

以下是您的辅助表格的设计和填充：

CREATE TABLE helper(
  surrkey     IDENTITY
, user_id     VARCHAR(100) -- does it really have to be that big?
, tenant_id   INT
, user_domain VARCHAR(100)
)
ORDER BY user_id,tenant_id,user_domain,surrkey
SEGMENTED BY HASH(surrkey) ALL NODES;

INSERT /*+DIRECT */ INTO helper
SELECT DISTINCT user_id,tenant_id,user_domain FROM users;

简而言之：独立于DBMS：如果有任何避免的方式，请不要加入JOIN或GROUP BY列，如果实现的话，它们占用的字节数会超过十几个字节。在您的情况下，在Vertica扁平表上下文中，是350倍。

Vertica REFRESH_COLUMNS失败，出现“连接内部不适合内存”错误

1 个答案: