对于流插入,我想使用模板表(带有用户ID后缀),该模板表本身是分区表。这样,我可以使表变得比仅使用分区表更小,从而使查询更具成本效益。而且,无论我的系统中有多少用户,我的每用户查询成本都保持不变。根据{{3}}
上的文档要按日期创建较小的数据集,请使用按时间划分的表。要创建不基于日期的较小表,请使用模板表,BigQuery会为您创建表。
听起来好像它可以是时间分区表或模板表。不能两者兼而有之吗?如果没有,我应该考虑其他架构吗?
与我上面提出的体系结构有关的另一个问题是我在https://cloud.google.com/bigquery/streaming-data-into-bigquery:-上看到的4000个限制。 这是否意味着我的分区表不能覆盖4000天以上?在这种情况下,我必须删除旧分区还是最后一个分区继续存储任何后续的流数据?
答案 0 :(得分:2)
You should look into Clustered Tables on partitioned tables.
With that you can have ONE table with all users in it, partitioned by time, and clustered by user_id as you would use in a template table.
Introduction to Clustered Tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Clustered table pricing
When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data. Clustered tables help you to reduce query costs by pruning data so it is not processed by the query.