我们不知道为什么在运行这个简单的测试时,DataflowAssert会失败:
@Test
@Category(RunnableOnService.class)
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
p.run();
}
我们遇到以下异常:
Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement
SEVERE: DataflowAssert failed expectations.
java.lang.AssertionError:
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:865)
at org.junit.Assert.assertThat(Assert.java:832)
at ...
为了简化DataflowAssert测试,我们硬编码Table.join的输出以匹配DataflowAssert,具有:
private static final TableRow TABLEROW_TEST = new TableRow()
.set("id", "x");
static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
PCollection<TableRow> pCollectionTable2) throws Exception {
final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();
PCollection<KV<String, String>> table1Data = pCollectionTable1
.apply(ParDo.of(new ExtractTable1DataFn()));
PCollection<KV<String, String>> table2Data = pCollectionTable2
.apply(ParDo.of(new ExtractTable2DataFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
.apply(CoGroupByKey.<String> create());
PCollection<KV<String, String>> resultCollection = kvpCollection
.apply(ParDo.named("Process join")
.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
private static final long serialVersionUID = 0;
@Override
public void processElement(ProcessContext c) {
// System.out.println(c);
KV<String, CoGbkResult> e = c.element();
String key = e.getKey();
String value = null;
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
}
}));
PCollection<TableRow> formattedResults = resultCollection.apply(
ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("id", "x");
c.output(row);
}
}));
return formattedResults;
}
有谁知道我们做错了什么?
答案 0 :(得分:1)
我认为错误消息告诉您实际集合包含的元素副本多于预期。
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
这是hamcrest,表示你想要一个可迭代的单个元素,但实际的集合有一个不匹配的项目。由于“格式连接”中出现的所有项目具有相同的值,因此使其更难以阅读。
具体来说,这是我运行以下测试时产生的消息,该测试检查具有两份row
副本的集合是否只包含row
的一个副本:
@Category(RunnableOnService.class)
@Test
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
TableRow row = new TableRow().set("id", "x");
PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
DataflowAssert.that(rows).containsInAnyOrder(row);
p.run();
}
为了用你的代码获得结果,我不得不利用你只迭代table2中的条目这一事实。具体做法是:
// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]
// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
("keyB", ([], ["B2"]))]
// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
"B2,B2"]
// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]
请注意,“过程加入”的DoFn
可能无法产生预期结果,如下所示:
String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
// NOTE: this updates value, and doesn't output it. So for each
// key there will be a single output with the *last* value
// rather than one for each pair.
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));