我正在使用Kafka源和接收器连接器创建数据管道。源连接器正在从SQL数据库中使用并发布到主题中,而Sink连接器订阅了该主题并放入了其他SQL数据库中。表格有16 GB的数据。现在的问题是,数据没有从一个数据库传输到另一个数据库。但是,如果表的大小很小(如1000行),则说明数据已成功传输。
源连接器配置:
"config": {
"connector.class":
"io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"connection.url": "",
"mode": "incrementing",
"incrementing.column.name": "ID",
"topic.prefix": "migration_",
"name": "jdbc-source",
"validate.non.null": false,
"batch.max.rows":5
}
源连接器日志:
INFO WorkerSourceTask{id=cmc-migration-source-0} flushing 0 outstanding messages for offset commit
[2019-03-08 16:48:45,402] INFO WorkerSourceTask{id=cmc-migration-source-0} Committing offsets
[2019-03-08 16:48:45,402] INFO WorkerSourceTask{id=cmc-migration-source-0} flushing 0 outstanding messages for offset commit
[2019-03-08 16:48:55,403] INFO WorkerSourceTask{id=cmc-migration-source-0} Committing offsets(org.apache.kafka.connect.runtime.WorkerSourceTask:397)
有人可以指导我如何调整我的Kafka源连接器以传输大数据吗?
答案 0 :(得分:0)
我设法通过将单个查询中返回的记录数量限制到数据库来克服这个问题,例如一次 5000。
解决方案将取决于数据库和 SQL 方言。下面的示例将为单个表正常工作和管理偏移。必须按照此处指定的说明设置递增 ID 列和时间戳:https://docs.confluent.io/kafka-connect-jdbc/current/source-connector/index.html#incremental-query-modes
示例表 myTable
具有以下列:
id
每次添加新记录时递增lastUpdatedTimestamp
- 每次记录更新时更新id
和 lastUpdatedTimestamp
必须唯一标识数据集中的记录。
连接器构造查询如下:
config.query
+ Kafka Connect WHERE clause for a selected mode
+ config.query.suffix
PostgreSQL / MySQL
"config": {
...
"poll.interval.ms" : 10000,
"mode":"timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "lastUpdatedTimestamp",
"table.whitelist": "myTable",
"query.suffix": "LIMIT 5000"
...
}
将导致:
SELECT *
FROM "myTable"
WHERE "myTable"."lastUpdatedTimestamp" < ?
AND (
("myTable"."lastUpdatedTimestamp" = ? AND "myTable"."id" > ?)
OR
"myTable"."lastUpdatedTimestamp" > ?
)
ORDER BY
"myTable"."lastUpdatedTimestamp",
"myTable"."id" ASC
LIMIT 5000
如果您想在 WHERE 子句中添加附加条件,则可以使用以下方法。
"config": {
...
"poll.interval.ms" : 10000,
"mode":"timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "lastUpdatedTimestamp",
"query": "SELECT * FROM ( SELECT id, lastUpdatedTimestamp, name, age FROM myTable WHERE Age > 18) myQuery",
"query.suffix": "LIMIT 5000"
...
}
将导致:
SELECT *
FROM (
SELECT id, lastUpdatedTimestamp, name, age
FROM myTable
WHERE Age > 18
) myQuery
WHERE "myTable"."lastUpdatedTimestamp" < ?
AND (
("myTable"."lastUpdatedTimestamp" = ? AND "myTable"."id" > ?)
OR
"myTable"."lastUpdatedTimestamp" > ?
)
ORDER BY
"myTable"."lastUpdatedTimestamp",
"myTable"."id" ASC
LIMIT 5000
SQL Server
"config": {
...
"poll.interval.ms" : 10000,
"mode":"timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "lastUpdatedTimestamp",
"query": "SELECT TOP 5000 * FROM (SELECT id, lastUpdatedTimestamp, name, age FROM myTable) myQuery",
...
}
将导致:
SELECT TOP 5000 *
FROM (
SELECT id, lastUpdatedTimestamp, name, age
FROM myTable
WHERE Age > 18
) myQuery
WHERE "myTable"."lastUpdatedTimestamp" < ?
AND (
("myTable"."lastUpdatedTimestamp" = ? AND "myTable"."id" > ?)
OR
"myTable"."lastUpdatedTimestamp" > ?
)
ORDER BY
"myTable"."lastUpdatedTimestamp",
"myTable"."id" ASC
甲骨文
"config": {
...
"poll.interval.ms" : 10000,
"mode":"timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "lastUpdatedTimestamp",
"query": "SELECT * FROM (SELECT id, lastUpdatedTimestamp, name, age FROM myTable WHERE ROWNUM <= 5000) myQuery",
...
}
将导致:
SELECT *
FROM (
SELECT id, lastUpdatedTimestamp, name, age
FROM myTable
WHERE ROWNUM <= 5000
) myQuery
WHERE "myTable"."lastUpdatedTimestamp" < ?
AND (
("myTable"."lastUpdatedTimestamp" = ? AND "myTable"."id" > ?)
OR
"myTable"."lastUpdatedTimestamp" > ?
)
ORDER BY
"myTable"."lastUpdatedTimestamp",
"myTable"."id" ASC
此方法不适用于 bulk
模式。它适用于 timestamp+incrementing
模式,也可能适用于 timestamp
或 incrementing
模式,具体取决于表的特征。
加入很多表 - 我没有测试过的想法!
如果查询在多个表之间执行连接,则会变得更加复杂。它将需要以下内容:
由于 Kafka Connect 使用的 Java 长数据类型,即 9,223,372,036,854,775,807