Question

我写了一个udf（extends EvalFunc<Tuple>），它有内部元组的输出元组（嵌套）。

例如转储如下：

(((photo,photos,photo)))
(((wedg,wedge),(audusd,audusd)))
(((quantum,quantum),(mind,mind)))
(((cassi,cassie),(cancion,canciones)))
(((calda,caldas),(nova,novas),(rodada,rodada)))
(((fingerprint,fingerprint),(craft,craft),(easter,easter)))

现在我想要处理每个术语，区分它并给它一个id（RANK）。要做到这一点，我需要摆脱括号。在这种情况下，简单的FLATTEN无济于事。

最终输出应该是：

1 photo
2 photos
3 wedg
4 wedge
5 audusd
6 quantum
7 mind
....

我的代码（不是udf部分而不是原始解析）：

tags = FOREACH raw GENERATE FLATTEN(tags) AS tag;
tags_distinct = DISTINCT tags;
tags_sorted = RANK tags_distinct BY tag;
DUMP tags_sorted;

Answer 1

我认为您的UDF返回并不是您工作流程的最佳选择。而不是返回具有可变数量的字段（这是元组）的元组，返回一包元组会更方便。

而不是

(((wedg,wedge),(audusd,audusd)))

你会有

({(wedg,wedge),(audusd,audusd)})

你将能够将那个包放到： 1.制作DISTINCT 2.排名标签

为此，请更新您的UDF：

class MyUDF extends EvalFunc <DataBag> {

    @Override
    public DataBag exec(Tuple input) throws IOException {
        // create DataBag
    }
}

将嵌套元组取消嵌套到单个术语

1 个答案: