Question

我的列表1包含以下架构

{customerId：int，storeId：int，products：{（prodId：int，name：chararray）}}

具有以下架构的客户列表

{uniqueId：int，customerId：int，name：chararray}

使用以下架构存储列表

{uniqueId：int，storeNum：int，name：chararray}

和带有架构的产品清单

{uniqueId：int，sku：int，productName：chararray}

现在我想搜索列表1中每个项目的customerId，storeId和prodId以及其他列表，以检查ID是否有效。有效的项目必须存储在文件中，而无效的项目存储在另一个文件中。

由于PIG对我来说很新，我觉得这很复杂。请给我一个很好的逻辑来使用Apache PIG来完成这项工作。

Answer 1

首先加载所有数据，将它们视为表格

cust_data = LOAD '\your\path\to\customer\data' USING PigStorage() as (uniqueId: int, customerId: int, name: chararray);

store_data = LOAD '\your\path\to\store\data' USING PigStorage() as (uniqueId: int, storeNum: int, name: chararray);

product_data = LOAD '\your\path\to\product\data' USING PigStorage() as (uniqueId: int, sku: int, productName: chararray);

您可以通过

检查加载的数据架构

DESCRIBE cust_data;
DESCRIBE store_data;
DESCRIBE product_data;

首先使用uniqueId（我们正在进行等值连接）

来加入客户和商店数据

cust_store_join = JOIN cust_data BY uniqueId, store_data BY uniqueId;

然后生成列

cust_store = FOREACH cust_store_join GENERATE cust_data::uniqueId as uniqueId, cust_data::customerId as customerId, cust_data::name as cust_name, store_data::storeNum as storeNum, store_data::name as store_name;

现在使用uniqueId加入客户商店和产品（我们正在进行等值连接）

cust_store_product_join = JOIN cust_store BY uniqueId, product_data BY uniqueId;

最终生成所有您想要的列

customer_store_product = FOREACH cust_store_product_join GENERATE cust_store::uniqueId as uniqueId, cust_store::customerId as customerId, cust_store::cust_name as cust_name, cust_store::storeNum as storeNum, product_data::sku as sku, product_data::productName as productName;

现在将您想要的列存储在local / hdfs目录中 below store命令将存储来自所有三个表的所有匹配的uniqueId，即客户，商店，产品

STORE customer_store_product INTO '\your\output\path' USING PigStorage(',');

同样，您可以加入list1架构并使用相同的逻辑生成列和存储数据。希望这有帮助

如何使用PIG脚本验证列表

1 个答案: