Question

我需要知道哪种数据库模型更适合性能。

第一个数据库模型

三张桌子。功能，产品，功能值。

功能表是

+----+-----------+
| id | name      |
+----+-----------+
|  1 | Brand     |
|  2 | Color     |
|  3 | Dimension |
|  4 | Model     |
+----+-----------+

功能值表

+----+---------+------------+
| id | name    | feature_id |
+----+---------+------------+
|  1 | Sony    |          1 |
|  2 | Samsung |          1 |
|  3 | Red     |          2 |
|  4 | Blue    |          2 |
|  5 | 20 "    |          3 |
|  6 | 30 "    |          3 |
|  7 | Model A |          4 |
|  8 | Model B |          4 |
+----+---------+------------+

和产品表。

+----+--------------------+----------+
| id | product_name       | features |
+----+--------------------+----------+
|  1 | Sony Television    | 1-3-5-7  |
|  2 | Samsung Television | 2-4-6-8  |
+----+--------------------+----------+

正如您在此结构中所看到的，如果用户想要根据功能搜索产品，我需要在查询中使用REGEXP或全文搜索。

第二个数据库模型

在第二个数据库模型中，我将从products表中删除功能，我将添加名为product_features的新表。

+----+--------------------+
| id | product_name       |
+----+--------------------+
|  1 | Sony Television    |
|  2 | Samsung Television |
+----+--------------------+

新表product_features;

+----+------------+------------+
| id | feature_id | product_id |
+----+------------+------------+
|  1 |          1 |          1 |
|  2 |          3 |          1 |
|  3 |          5 |          1 |
|  4 |          7 |          1 |
|  5 |          2 |          2 |
|  6 |          4 |          2 |
|  7 |          6 |          2 |
|  8 |          8 |          2 |
+----+------------+------------+

现在，如果用户想要根据功能搜索产品，我需要搜索product_features，然后加入产品。

问题

我的问题是，如果我使用第二个模型，并且我在产品表中有超过200亿行，并且认为每个产品至少有10个功能，那么product_features表将有超过200亿行。也许根据功能的查询会很慢。

如果我使用第一个模型，当用户根据功能搜索时，我必须使用全文搜索或REGEXP查询20亿行。

我不知道哪条路更好？你的建议是什么？

Answer 1

第一个模型

它甚至不是1NF形式，因为它在features属性中具有非原子值。此外，在产品表中添加，更新或删除任何新功能将非常困难。所以它根本不可行。

第二个模型

它被规范化到5NF并且看起来很好，用于优化搜索使用子查询并在product_id和feature_id上使用索引。尽量避免在如此大的数据中使用JOIN。

Answer 2

正如Rockse所说，你应该坚持第二种模式。对于大型数据集，一旦你增长到某个数据库实例开始变得太大的特定大小，你需要开始缩放＆＃34;水平＆＃34; （跨越几个实例）。扩展这种非常大的数据集的一种常用方法称为＆＃34;分片＆＃34;：将数据集拆分为子集并将它们存储在不同的数据库服务器上。然后提出一种算法，告诉您的应用程序去哪个数据库获取有关某个产品的信息。

例如，让我们将数据集拆分为4个块，每个块大约50亿行。然后使用＆＃34; product_id％4＆＃34; （那个模4）作为＆＃34;键＆＃34;这将告诉您哪个数据库实例包含有关该特定产品的信息。一个非常粗略的伪代码可能如下所示：

connections = []

function initConnections() {
   ... connect to 4 different databases or create pools ...
   connections = [conn1, conn2, conn3, conn4];
}

function getProductDbConnection(productId) {
  return connections[productId%4];
}

function getProductFeatures(productId) {
  conn = getProductDbConnection(productId);
  ... run whatever queries you need to get features ...
}

这篇文章讨论了Instagram如何对数据进行分片以满足需求：http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram

MYSQL（innoDB）中产品功能的电子商务数据库结构

2 个答案: