您可以阅读我的设置here。我解决了那里描述的问题,但我有新问题。
我正在阅读3张桌子的数据。我有一个(最大的)表的问题。我从表中读取了很多数据,速率大约为300000行/秒,但是在大约10个小时后(当从其他两个表读完时),它减少到~20000行/秒。 24小时后它还没有完成。
日志中有很多可疑的行:
I Proposing dynamic split of work unit cybrmt;2018-01-17_22_54_11-12138573770170126316;3251780906818434621 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
I Proposing dynamic split of work unit cybrmt;2018-01-17_22_54_11-12138573770170126316;3251780906818434621 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
I Proposing dynamic split of work unit cybrmt;2018-01-17_22_54_11-12138573770170126316;3251780906818434621 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
I Proposing dynamic split of work unit cybrmt;2018-01-17_22_54_11-12138573770170126316;3251780906818434621 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
I Proposing dynamic split of work unit cybrmt;2018-01-17_22_54_11-12138573770170126316;3251780906818434621 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
I Proposing dynamic split of work unit cybrmt;2018-01-17_22_54_11-12138573770170126316;3251780906818434621 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
I Proposing dynamic split of work unit cybrmt;2018-01-17_22_54_11-12138573770170126316;3251780906818434621 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
工作以例外结束:
(f000632be487340d): Workflow failed. Causes: (844d65bb40eb132b): S14:Read from Cassa table/Read(CassandraSource)+Transform to KV by id+CoGroupByKey id/MakeUnionTable0+CoGroupByKey id/GroupByKey/Reify+CoGroupByKey id/GroupByKey/Write failed., (c07ceebe5d95f668): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
starterpipeline-sosenko19-01172254-4260-harness-wrdk,
starterpipeline-sosenko19-01172254-4260-harness-xrkd,
starterpipeline-sosenko19-01172254-4260-harness-hvfd,
starterpipeline-sosenko19-01172254-4260-harness-0pf5
有两张桌子。一个有大约20亿行,每行有唯一的密钥(每个密钥1行)。其次有大约200亿行,每个键少于或等于10行。
以下是CoGroupByKey match_id
块内的内容:
// Create pipeline
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
// Read data from Cassandra table opendota_player_match_by_account_id2
PCollection<OpendotaPlayerMatch> player_matches = p.apply("Read from Cassa table opendota_player_match_by_account_id2", CassandraIO.<OpendotaPlayerMatch>read()
.withHosts(Arrays.asList("10.132.9.101", "10.132.9.102", "10.132.9.103", "10.132.9.104")).withPort(9042)
.withKeyspace("cybermates").withTable(CASSA_OPENDOTA_PLAYER_MATCH_BY_ACCOUNT_ID_TABLE_NAME)
.withEntity(OpendotaPlayerMatch.class).withCoder(SerializableCoder.of(OpendotaPlayerMatch.class))
.withConsistencyLevel(CASSA_CONSISTENCY_LEVEL));
// Transform player_matches to KV by match_id
PCollection<KV<Long, OpendotaPlayerMatch>> opendota_player_matches_by_match_id = player_matches
.apply("Transform player_matches to KV by match_id", ParDo.of(new DoFn<OpendotaPlayerMatch, KV<Long, OpendotaPlayerMatch>>() {
@ProcessElement
public void processElement(ProcessContext c) {
// LOG.info(c.element().match_id.toString());
c.output(KV.of(c.element().match_id, c.element()));
}
}));
// Read data from Cassandra table opendota_match
PCollection<OpendotaMatch> opendota_matches = p.apply("Read from Cassa table opendota_match", CassandraIO.<OpendotaMatch>read()
.withHosts(Arrays.asList("10.132.9.101", "10.132.9.102", "10.132.9.103", "10.132.9.104")).withPort(9042)
.withKeyspace("cybermates").withTable(CASSA_OPENDOTA_MATCH_TABLE_NAME).withEntity(OpendotaMatch.class)
.withCoder(SerializableCoder.of(OpendotaMatch.class))
.withConsistencyLevel(CASSA_CONSISTENCY_LEVEL));
// Read data from Cassandra table match
PCollection<OpendotaMatch> matches = p.apply("Read from Cassa table match", CassandraIO.<Match>read()
.withHosts(Arrays.asList("10.132.9.101", "10.132.9.102", "10.132.9.103", "10.132.9.104")).withPort(9042)
.withKeyspace("cybermates").withTable(CASSA_MATCH_TABLE_NAME).withEntity(Match.class)
.withCoder(SerializableCoder.of(Match.class))
.withConsistencyLevel(CASSA_CONSISTENCY_LEVEL))
.apply("Adopt match for uniform structure", ParDo.of(new DoFn<Match, OpendotaMatch>() {
@ProcessElement
public void processElement(ProcessContext c) {
// LOG.info(c.element().match_id.toString());
OpendotaMatch m = new OpendotaMatch();
// opendota_match and match tables have slightly different schema. I've cut out conversion here because it's large and dummy
c.output(m);
}
}));
// Union match and opendota_match
PCollectionList<OpendotaMatch> matches_collections = PCollectionList.of(opendota_matches).and(matches);
PCollection<OpendotaMatch> all_matches = matches_collections.apply("Union match and opendota_match", Flatten.<OpendotaMatch>pCollections());
// Transform matches to KV by match_id
PCollection<KV<Long, OpendotaMatch>> matches_by_match_id = all_matches
.apply("Transform matches to KV by match_id", ParDo.of(new DoFn<OpendotaMatch, KV<Long, OpendotaMatch>>() {
@ProcessElement
public void processElement(ProcessContext c) {
// LOG.info(c.element().players.toString());
c.output(KV.of(c.element().match_id, c.element()));
}
}));
// CoGroupByKey match_id
// Replicate data
final TupleTag<OpendotaPlayerMatch> player_match_tag = new TupleTag<OpendotaPlayerMatch>();
final TupleTag<OpendotaMatch> match_tag = new TupleTag<OpendotaMatch>();
PCollection<KV<Long, PMandM>> joined_matches = KeyedPCollectionTuple
.of(player_match_tag, opendota_player_matches_by_match_id).and(match_tag, matches_by_match_id)
.apply("CoGroupByKey match_id", CoGroupByKey.<Long>create())
.apply("Replicate data", ParDo.of(new DoFn<KV<Long, CoGbkResult>, KV<Long, PMandM>>() {
@ProcessElement
public void processElement(ProcessContext c) {
try {
OpendotaMatch m = c.element().getValue().getAll(match_tag).iterator().next();
Iterable<OpendotaPlayerMatch> pms = c.element().getValue().getAll(player_match_tag);
for (OpendotaPlayerMatch pm : pms) {
if (0 <= pm.account_id && pm.account_id < MAX_UINT) {
for (OpendotaPlayerMatch pm2 : pms) {
c.output(KV.of(pm.account_id, new PMandM(pm2, m)));
}
}
}
} catch (NoSuchElementException e) {
LOG.error(c.element().getValue().getAll(player_match_tag).iterator().next().match_id.toString() + " " + e.toString());
}
}
}));
// Transform to byte array
// Write to BQ
joined_matches
.apply("Transform to byte array, Write to BQ", BigQueryIO.<KV<Long, PMandM>>write().to(new DynamicDestinations<KV<Long, PMandM>, String>() {
public String getDestination(ValueInSingleWindow<KV<Long, PMandM>> element) {
return element.getValue().getKey().toString();
}
public TableDestination getTable(String account_id_str) {
return new TableDestination("cybrmt:" + BQ_DATASET_NAME + ".player_match_" + account_id_str,
"Table for user " + account_id_str);
}
public TableSchema getSchema(String account_id_str) {
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("value").setType("BYTES"));
return new TableSchema().setFields(fields);
}
}).withFormatFunction(new SerializableFunction<KV<Long, PMandM>, TableRow>() {
public TableRow apply(KV<Long, PMandM> element) {
OpendotaPlayerMatch pm = element.getValue().pm;
OpendotaMatch m = element.getValue().m;
TableRow tr = new TableRow();
ByteBuffer bb = ByteBuffer.allocate(114);
// I've cut out transform to byte buffer here because it's large and dummy
tr.set("value", bb.array());
return tr;
}
}));
p.run();
我试图单独从上面阅读问题表。 Pipeline包含CassandraIO.Read变换和带有一些日志输出的伪ParDo变换。而现在它的行为就像完整的管道。有一个(我相信最后)分裂无法完成:
I Proposing dynamic split of work unit cybrmt;2018-01-20_21_28_01-3451798636786921663;1617811313034836533 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
以下是管道图:
以下是代码:
// Create pipeline
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
// Read data from Cassandra table opendota_player_match_by_account_id2
PCollection<OpendotaPlayerMatch> player_matches = p.apply("Read from Cassa table opendota_player_match_by_account_id2", CassandraIO.<OpendotaPlayerMatch>read()
.withHosts(Arrays.asList("10.132.9.101", "10.132.9.102", "10.132.9.103", "10.132.9.104")).withPort(9042)
.withKeyspace("cybermates").withTable(CASSA_OPENDOTA_PLAYER_MATCH_BY_ACCOUNT_ID_TABLE_NAME)
.withEntity(OpendotaPlayerMatch.class).withCoder(SerializableCoder.of(OpendotaPlayerMatch.class))
.withConsistencyLevel(CASSA_CONSISTENCY_LEVEL));
// Print my matches
player_matches.apply("Print my matches", ParDo.of(new DoFn<OpendotaPlayerMatch, Long>() {
@ProcessElement
public void processElement(ProcessContext c) {
if (c.element().account_id == 114688838) {
LOG.info(c.element().match_id.toString());
c.output(c.element().match_id);
}
}
}));
p.run();
小管道(CassandraIO.Read和ParDo)在23小时内成功完成。前4个小时有最大工人数(40)和很高的阅读速度(~300000行/秒)。在那个数量的工人自动缩放到1以及读取速度达到~15000行/秒之后。这是图表:
这是日志结尾:
I Proposing dynamic split of work unit cybrmt;2018-01-20_21_28_01-3451798636786921663;1617811313034836533 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
I Proposing dynamic split of work unit cybrmt;2018-01-20_21_28_01-3451798636786921663;1617811313034836533 at {"fractionConsumed":0.5}
I Rejecting split request because custom reader returned null residual source.
I Success processing work item cybrmt;2018-01-20_21_28_01-3451798636786921663;1617811313034836533
I Finished processing stage s01 with 0 errors in 75268.681 seconds
答案 0 :(得分:0)
我最后使用了@jkff建议并从具有不同分区密钥的表中读取数据,这些分区密钥更均匀地分布(实际上有两个表具有相同的数据,但我的数据模式中有不同的分区键)。