如何使用Google Spanner进行多个并行读取器的数据导出?

时间:2017-06-30 17:00:15

标签: google-cloud-platform google-cloud-spanner

External Backups/Snapshots for Google Cloud Spanner建议使用带时间戳边界的查询来创建导出快照。在Timestamp Bounds文档的底部,它指出:

  

Cloud Spanner不断垃圾在后台收集已删除和覆盖的数据以回收存储空间。此过程称为版本GC。默认情况下,版本GC会在一小时后回收版本。因此,Cloud Spanner过去不能在读取时间戳超过一小时执行读取。

所以任何出口都需要在一小时内完成。单个阅读器(即select * from table;使用时间戳X)将无法在一小时内导出整个表。

如何在扳手中实现多个并行读取器?

注意:其中一条评论中提到支持Apache Beam即将推出,但看起来它只使用一个阅读器:

  

/** A simplest read function implementation. Parallelism support is coming. */

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/NaiveSpannerReadFn.java#L26

有没有办法在今天使用现有API进行光束所需的并行读取器?或者梁需要使用尚未在谷歌扳手上发布的东西吗?

2 个答案:

答案 0 :(得分:1)

可以使用BatchClient类从Cloud Spanner并行读取数据。关注read_data_in_parallel以获取更多信息。

如果您希望从Cloud Spanner导出数据,我建议您使用Cloud Dataflow(请参阅集成详细信息here),因为它提供更高级别的抽象,并注意数据处理细节,如缩放和失败处理。

答案 1 :(得分:0)

编辑2018-03-30 - 示例项目已更新为使用Google Cloud Spanner提供的BatchClient

在发布用于读取/下载大量数据的BatchClient之后,the example project below已更新为使用新的批处理客户端而不是标准数据库客户端。该项目背后的基本思想仍然是相同的:使用标准的jdbc功能将数据复制到Cloud Spanner和任何其他数据库。以下代码段以批量读取模式设置jdbc连接:

if (source.isWrapperFor(ICloudSpannerConnection.class))
{
    ICloudSpannerConnection con = source.unwrap(ICloudSpannerConnection.class);
    // Make sure no transaction is running
    if (!con.isBatchReadOnly())
    {
        if (con.getAutoCommit())
        {
            con.setAutoCommit(false);
        }
        else
        {
            con.commit();
        }
        con.setBatchReadOnly(true);
    }
}

当连接处于“批量只读模式”时,连接将使用Google Cloud Spanner的BatchClient而不是标准数据库客户端。当调用其中一个Statement#execute(String)PreparedStatement#execute()方法时(因为这些方法允许返回多个结果集),jdbc驱动程序将创建分区查询而不是普通查询。此分区查询的结果将是一些结果集(每个分区一个),可以通过Statement#getResultSet()和Statement#getMoreResults(int)方法获取。

Statement statement = source.createStatement();
boolean hasResults = statement.execute(select);
int workerNumber = 0;
while (hasResults)
{
    ResultSet rs = statement.getResultSet();
    PartitionWorker worker = new PartitionWorker("PartionWorker-" + workerNumber, config, rs, tableSpec, table, insertCols);
    workers.add(worker);
    hasResults = statement.getMoreResults(Statement.KEEP_CURRENT_RESULT);
    workerNumber++;
}

Statement#execute(String)返回的结果集不会直接执行,而是仅在第一次调用ResultSet#next()之后执行。将这些结果集传递给单独的工作线程可确保并行下载和复制数据。

原始答案:

This project最初是为了从另一个方向(从本地数据库到Cloud Spanner)进行转换而创建的,但是因为它使用JDBC来源和目标,所以它也可以反过来使用:转换云Spanner数据库到本地PostgreSQL数据库。使用线程池并行转换大型表。

该项目使用this open source JDBC driver而不是Google提供的JDBC驱动程序。源Cloud Spanner JDBC连接设置为只读模式,autocommit = false。这可确保连接在第一次执行查询时使用当前时间作为时间戳自动创建只读事务。同一(只读)事务中的所有后续查询将使用相同的时间戳,为您提供Google Cloud Spanner数据库的一致快照。

它的工作原理如下:

  1. 将源数据库设置为只读事务模式。
  2. convert(String catalog,String schema)方法迭代所有 源数据库中的表(Cloud Spanner)
  3. 对于每个表,确定记录数,并根据表的大小,使用应用程序的主线程或工作池复制表。
  4. 类UploadWorker负责并行复制。从表中为每个工作人员分配一系列记录(例如,行1到2,400)。该范围由以下格式的select语句选择:'SELECT * FROM $ TABLE ORDER BY $ PK_COLUMNS LIMIT $ BATCH_SIZE OFFSET $ CURRENT_OFFSET'
  5. 转换完所有表后,在源数据库上提交只读事务。
  6. 以下是最重要部分的代码段。

    public void convert(String catalog, String schema) throws SQLException
    {
        int batchSize = config.getBatchSize();
        destination.setAutoCommit(false);
        // Set the source connection to transaction mode (no autocommit) and read-only
        source.setAutoCommit(false);
        source.setReadOnly(true);
        try (ResultSet tables = destination.getMetaData().getTables(catalog, schema, null, new String[] { "TABLE" }))
        {
            while (tables.next())
            {
                String tableSchema = tables.getString("TABLE_SCHEM");
                if (!config.getDestinationDatabaseType().isSystemSchema(tableSchema))
                {
                    String table = tables.getString("TABLE_NAME");
                    // Check whether the destination table is empty.
                    int destinationRecordCount = getDestinationRecordCount(table);
                    if (destinationRecordCount == 0 || config.getDataConvertMode() == ConvertMode.DropAndRecreate)
                    {
                        if (destinationRecordCount > 0)
                        {
                            deleteAll(table);
                        }
                        int sourceRecordCount = getSourceRecordCount(getTableSpec(catalog, tableSchema, table));
                        if (sourceRecordCount > batchSize)
                        {
                            convertTableWithWorkers(catalog, tableSchema, table);
                        }
                        else
                        {
                            convertTable(catalog, tableSchema, table);
                        }
                    }
                    else
                    {
                        if (config.getDataConvertMode() == ConvertMode.ThrowExceptionIfExists)
                            throw new IllegalStateException("Table " + table + " is not empty");
                        else if (config.getDataConvertMode() == ConvertMode.SkipExisting)
                            log.info("Skipping data copy for table " + table);
                    }
                }
            }
        }
        source.commit();
    }
    
    private void convertTableWithWorkers(String catalog, String schema, String table) throws SQLException
    {
        String tableSpec = getTableSpec(catalog, schema, table);
        Columns insertCols = getColumns(catalog, schema, table, false);
        Columns selectCols = getColumns(catalog, schema, table, true);
        if (insertCols.primaryKeyCols.isEmpty())
        {
            log.warning("Table " + tableSpec + " does not have a primary key. No data will be copied.");
            return;
        }
        log.info("About to copy data from table " + tableSpec);
    
        int batchSize = config.getBatchSize();
        int totalRecordCount = getSourceRecordCount(tableSpec);
        int numberOfWorkers = calculateNumberOfWorkers(totalRecordCount);
        int numberOfRecordsPerWorker = totalRecordCount / numberOfWorkers;
        if (totalRecordCount % numberOfWorkers > 0)
            numberOfRecordsPerWorker++;
        int currentOffset = 0;
        ExecutorService service = Executors.newFixedThreadPool(numberOfWorkers);
        for (int workerNumber = 0; workerNumber < numberOfWorkers; workerNumber++)
        {
            int workerRecordCount = Math.min(numberOfRecordsPerWorker, totalRecordCount - currentOffset);
            UploadWorker worker = new UploadWorker("UploadWorker-" + workerNumber, selectFormat, tableSpec, table,
                    insertCols, selectCols, currentOffset, workerRecordCount, batchSize, source,
                    config.getUrlDestination(), config.isUseJdbcBatching());
            service.submit(worker);
            currentOffset = currentOffset + numberOfRecordsPerWorker;
        }
        service.shutdown();
        try
        {
            service.awaitTermination(config.getUploadWorkerMaxWaitInMinutes(), TimeUnit.MINUTES);
        }
        catch (InterruptedException e)
        {
            log.severe("Error while waiting for workers to finish: " + e.getMessage());
            throw new RuntimeException(e);
        }
    
    }
    
    public class UploadWorker implements Runnable
    {
    private static final Logger log = Logger.getLogger(UploadWorker.class.getName());
    
    private final String name;
    
    private String selectFormat;
    
    private String sourceTable;
    
    private String destinationTable;
    
    private Columns insertCols;
    
    private Columns selectCols;
    
    private int beginOffset;
    
    private int numberOfRecordsToCopy;
    
    private int batchSize;
    
    private Connection source;
    
    private String urlDestination;
    
    private boolean useJdbcBatching;
    
    UploadWorker(String name, String selectFormat, String sourceTable, String destinationTable, Columns insertCols,
            Columns selectCols, int beginOffset, int numberOfRecordsToCopy, int batchSize, Connection source,
            String urlDestination, boolean useJdbcBatching)
    {
        this.name = name;
        this.selectFormat = selectFormat;
        this.sourceTable = sourceTable;
        this.destinationTable = destinationTable;
        this.insertCols = insertCols;
        this.selectCols = selectCols;
        this.beginOffset = beginOffset;
        this.numberOfRecordsToCopy = numberOfRecordsToCopy;
        this.batchSize = batchSize;
        this.source = source;
        this.urlDestination = urlDestination;
        this.useJdbcBatching = useJdbcBatching;
    }
    
    @Override
    public void run()
    {
        // Connection source = DriverManager.getConnection(urlSource);
        try (Connection destination = DriverManager.getConnection(urlDestination))
        {
            log.info(name + ": " + sourceTable + ": Starting copying " + numberOfRecordsToCopy + " records");
    
            destination.setAutoCommit(false);
            String sql = "INSERT INTO " + destinationTable + " (" + insertCols.getColumnNames() + ") VALUES \n";
            sql = sql + "(" + insertCols.getColumnParameters() + ")";
            PreparedStatement statement = destination.prepareStatement(sql);
    
            int lastRecord = beginOffset + numberOfRecordsToCopy;
            int recordCount = 0;
            int currentOffset = beginOffset;
            while (true)
            {
                int limit = Math.min(batchSize, lastRecord - currentOffset);
                String select = selectFormat.replace("$COLUMNS", selectCols.getColumnNames());
                select = select.replace("$TABLE", sourceTable);
                select = select.replace("$PRIMARY_KEY", selectCols.getPrimaryKeyColumns());
                select = select.replace("$BATCH_SIZE", String.valueOf(limit));
                select = select.replace("$OFFSET", String.valueOf(currentOffset));
                try (ResultSet rs = source.createStatement().executeQuery(select))
                {
                    while (rs.next())
                    {
                        int index = 1;
                        for (Integer type : insertCols.columnTypes)
                        {
                            Object object = rs.getObject(index);
                            statement.setObject(index, object, type);
                            index++;
                        }
                        if (useJdbcBatching)
                            statement.addBatch();
                        else
                            statement.executeUpdate();
                        recordCount++;
                    }
                    if (useJdbcBatching)
                        statement.executeBatch();
                }
                destination.commit();
                log.info(name + ": " + sourceTable + ": Records copied so far: " + recordCount + " of "
                        + numberOfRecordsToCopy);
                currentOffset = currentOffset + batchSize;
                if (recordCount >= numberOfRecordsToCopy)
                    break;
            }
        }
        catch (SQLException e)
        {
            log.severe("Error during data copy: " + e.getMessage());
            throw new RuntimeException(e);
        }
        log.info(name + ": Finished copying");
    }
    
    }