为了DR的目的,如何将Greenplum DB复制到另一个数据中心?

时间:2014-09-22 07:52:23

标签: greenplum

我们正在计划一个大型Greenplum DB(在前18个月内从10T增长到100TB)。传统的备份和恢复工具无济于事,因为我们有24小时的RPO / RTO需要处理。 有没有办法将数据库复制到我们的DR站点而无需求助于块复制(即在SAN和镜像上放置一个段)?

2 个答案:

答案 0 :(得分:1)

您有多种选择可供选择:

  1. 双ETL。复制输入数据并在两个站点上运行相同的ETL。每周左右与备份恢复同步
  2. 备份 - 恢复。简单的备份还原可能效率不高。但是,如果您使用DataDomain,它可以在块级别执行重复数据删除并仅存储已更改的块。它可以卸载重复数据删除任务以在Greenplum集群(DDBoost)上运行。此外,如果复制到远程站点,它将仅复制更改的块,这将大大减少复制时间。根据我的经验,如果DD上的干净备份需要12个小时,后续DDBoost备份将花费4小时+4小时来复制数据
  3. 定制解决方案。我知道将数据复制到远程站点作为ETL过程的一部分的情况。对于ETL作业,您知道哪些表已更改,它们将添加到复制队列并使用外部表移动到远程站点。分析师正在使用一个特殊的沙箱,他们的沙箱将每天备份 - 恢复
  4. 目前Greenplum没有内置的WAN复制解决方案,因此这几乎是所有可供选择的选项。

答案 1 :(得分:0)

我对此做了一些调查。这是我的结果

I.  Using EMC Symmetrix VMAX SAN(Storage Area Network)  Mirror  and SRDF (Symmetrix Remote Data Facility) remote replication software
Please refer to h12079-vnx-replication-technologies-overview-wp.pdf for details
  Preconditions
             1. Having EMC Symmetrix VMAX SAN installed
             2. Having SRDF softeware

  Advantages of 3 different modes
 1. Symmetrix Remote Data Facility / Synchronous (SRDF/S)
            Provides a no data loss solution (Zero RPO). 
            No server resource contention for remote mirroring operation. 
            Can perform restoration of primary site with minimal impact to      application.  Performance on remote site.  Enterprise
    disaster recovery solution.  Supports replicating over IP and Fibre
    Channel protocols. 

    2.  Symmetrix Remote Data Facility / Asynchronous (SRDF/A) Extended-distance data replication that supports longer distances
    than SRDF/S.  SRDF/A does not affect host performance, because host
    activity is decoupled from the remote copy process.  Efficient link
    utilization that results in lower link-bandwidth requirements.
    Facilities to invoke failover and restore operations.  Supports
    replicating over IP and Fibre Channel protocols. 

    3.  Symmetrix Remote Data Facility / Data Mobility (SRDF/DM)

II. Using Backup Tools

Please refer to http://gpdb.docs.pivotal.io/4350/admin_guide/managing/backup.html for details
Parallel Backup 
Parallel backup utility gpcrondump 

Non-parallel backup
It is not recommended. It is used for migrate PostgreSQL databases to GreenPlum databases

Parallel Restore 
Support system with the same configuration and different configuration with the source GreenPlum database configuration

Non-Parallel Restore
pg_restore requires to modified the create statement to add distributed by clause

 
Disadvantages
1.  The backup process locks table, it put an EXCLUSIVE lock on table pg_class. It means that read permission is only allowed in this period. 
2.  After releasing the EXCLUSIVE lock on table pg_clas, it will put an ACCESS SHARE lock on all the tables, it only allows read access during the lock period.


III.    Replay DDL statements 
In PostgreSQL, there is a parameters to log all the sql statements to a file. 
In the data/postgresql.conf, modify log_statement to ‘all’
Write an application to get the DML and DDL statement, and run them in the DR servers.
Advantage
1.  Easy to configure and maintain
2.  No decrease in the performance
Disadvantage
1.  Need additional storage for the statement logging

IV. Parse the wal log of PostgreSQL 
Parse the wal log to extract the DDL statement from the log and then run all the generated DDL statements in the DR GreenPlum
Advantage
1.  Doesn’t impact the source GreenPlum Database
Disadvantage
1.  Write code to parse the wal log
2.  Not easy to parse the log, there are not enough documents about the wal log. 
3.  Don’t know if it is feasible for GreenPlum, as it is one solution for PostgreSQL.