Question

我被猪查询困住了。我有一个数据文件，其中包含客户信息和两个客户数据可用的文件。

数据文件可能是

CustomerId年龄
100 27
101 17
102 25
103 21

File1可能是

CustomerId性别
100 M
102 F

File2可能是

CustomerId性别
101 F
102 M
103 M

现在，我希望以下列方式输出。如果File1中存在客户ID，则应从File1中选择Gender。如果不存在，则应该从中挑选 file2的。

所以，我希望输出为

CustomerId年龄性别
100 27 M
101 17 F
102 25 F（从File1和File1中选取优先权）
103 21 M

因此，如果我尝试在CustomerId上使用File1对数据文件进行左外连接，我将获得CustomerId 101和103的性别的NULL值。所以，现在我希望从File2填充CustomerIds 101和103的性别值。我无法完成这项工作。另外，我们是否需要首先进行左外连接。

Answer 1

假设您已经加载了数据：

DESCRIBE file1;
file1: {(id:int, gender:chararray)}
DESCRIBE file2;
file2: {(id:int, gender:chararray)}

你加入他们就像：

joined = JOIN file1 BY id FULL OUTER, file2 BY id;
DESCRIBE joined;
joined: {(file1::id:int, file1::gender:chararray, file2::id:int, file2::gender:chararray)}

在保证非零性别的同时赋予文件1优先权只需要三元运算符：

genders =
    FOREACH joined
    GENERATE
        ((file1::id IS NOT NULL) ? file1::id : file2::id) AS id,
        ((file1::gender IS NOT NULL) ? file1::gender : file2::gender) AS gender;

现在，您已经为每个客户ID提供了一个主要的性别列表，您可以将其与您的主数据文件一起加入，并按照您喜欢的方式进行操作。

用两个文件制作猪查询

1 个答案: