Question

TL; RD：

使用主键进行数据库分区
索引大小问题。
数据库大小每天增长1-3 GB
Raid设置。
你有使用Hypertable的经验吗？

长版：

我刚刚建立/购买了一台家庭服务器：

Xeon E3-1245 3,4 HT
32GB RAM
6x 1,5 TB WD Cavier Black 7200

我将使用服务器主板INTEL S1200BTL Raid （raid控制器没有剩余资金）。 http://ark.intel.com/products/53557/Intel-Server-Board-S1200BTL

主板有4个SATA 3GB / s端口和2个SATA 6GB / s

我还不确定我是否可以在RAID 10中设置所有6hdd，

如果不可能，我认为4x hdds Raid 10（MYSQL DB）＆amp; （OS / Mysql索引）2xhdds Raid 0。

（如果raid 0中断，对我来说没问题，我只需要保护数据库）

关于数据库：

它是一个网络抓取工具数据库，其中存储了域名，网址，链接和此类内容。所以我认为我使用每个表的主键来分区数据库 （1-1000000）（1000001-2000000）等。

当我在数据库中搜索/插入/选择查询时，我需要扫描空洞表，因为有些东西可能在ROW 1中而另一些可能在ROW 1000000000000中。

如果按主键（auto_increment）进行此类分区，是否会使用所有CPU内核？这样它会扫描每个分区并行？或者我应该坚持使用一个没有分区的庞大的数据库。

数据库将非常大，在我的家庭系统现在它，

Table extract:  25,034,072 Rows
Data    2,058.7     MiB
Index   2,682.8     MiB
Total   4,741.5     MiB

Table Structure:
extract_id          bigint(20)      unsigned        NO  PRI     NULL    auto_increment
url_id       bigint(20)         NO      MUL     NULL    
extern_link     varchar(2083)           NO      MUL     NULL    
anchor_text     varchar(500)            NO      NULL    
http_status     smallint(2)     unsigned    NO      0

Indexes:
PRIMARY     BTREE   Yes No  extract_id      25034072

link        BTREE   Yes No  url_id
                            extern_link (400)   25034072

externlink      BTREE   No  No  extern_link (400)   1788148 


Table urls: 21,889,542 Rows
Data    2,402.3     MiB
Index   3,456.2     MiB
Total   5,858.4     MiB

Table Structure:
url_id      bigint(20)      NO  PRI     NULL    auto_increment
domain_id           bigint(20)      NO  MUL     NULL    
url             varchar(2083)       NO      NULL    
added       date    NO      NULL    
last_crawl      date    NO      NULL    
extracted           tinyint(2) unsigned NO  MUL     0   
extern_links    smallint(5) unsigned    NO      0   
crawl_status    tinyint(11) unsigned    NO      0   
status      smallint(2) unsigned    NO      0


INDEXES:
PRIMARY     BTREE   Yes No  url_id      21889542

domain_id       BTREE   Yes No  domain_id   0
                        url (330)   21889542

extracted_status    BTREE   No  No  extracted   2
                        status      31

我看到我可以修复外部链接＆amp;链接索引，我刚刚添加了 externlink ，因为我需要查询该字段，而我无法使用链接索引。你看，我可以调整索引吗？我的新系统将有32 GB，但如果DB以这个速度增长，我将在FEW周/月内使用90％的RAM。

打包的INDEX 是否有帮助？（性能如何下降？）

其他重要表格不足500MB。

Only the URL Source table is huge: 48.6 GiB 
Structure: 

    url_id  BIGINT
    pagesource mediumblob data is packed with gzip high compression

    Index is only on url_id (unique).

当我提取了我需要的所有内容时，可以从此表中删除数据。

您对 Hypertables 有什么经验吗？ http://hypertable.org/＆lt; = Googles Bigtables。如果我转移到Hypertables，这对我的性能有帮助（提取数据/搜索/插入/选择＆amp; 数据库大小）。我在页面上看到了，但我仍然有点无能为力。因为你无法直接将MYSQL与Hypertables进行比较。我会尽快试一试，首先要阅读文档。

我需要的是一个适合我设置的解决方案，因为我没有任何其他硬件设置的余地。

感谢您的帮助。

Answer 1

Hypertable是抓取数据库的绝佳选择。 Hypertable是一个以Google的Bigtable为模型的开源，高性能，可扩展的数据库。 Google专门为其抓取数据库开发了Bigtable。我建议阅读Bigtable paper，因为它使用爬网数据库作为运行示例。

Answer 2

关于＃4（RAID设置），不建议将RAID5用于生产服务器。关于它的好文章 - ＆gt; http://www.dbasquare.com/2012/04/02/should-raid-5-be-used-in-a-mysql-server/

最佳Mysql配置（Partiontion）＆amp;索引/ Hypertable / RAID配置（大型数据库）

2 个答案: