Question

我正在使用pig加载以逗号分隔的文件/文件夹hadoop范围。（this question on how to load multiple files in pig

问题是每个文件夹都有不同的架构文件（位于文件夹的一侧） - 是否可以同时提供多个架构文件？

Answer 1

如果您的架构文件位于文件夹之外，则必须在执行加载时声明架构。

例如：

dataset_A = LOAD '/data/A' using PigStorage('\t') as (id:int, project:chararray, org:chararray); 
dataset_B = LOAD '/data/B' using PigStorage(',') as (id:int, beta:chararray, delta:chararray, echo:int);

如果在目录中的.pig_schema文件中有声明的模式，则只需执行加载，而不必声明模式。

dataset_A = LOAD '/data/A' using PigStorage('\t'); 
dataset_B = LOAD '/data/B' using PigStorage(',');

的 /data/A/.pig_schema：

{"fields": [{"name":"id","type":10,"description":"autogenerated from Pig Field Schema","schema":null}, {"name":"project","type":55,"description":"autogenerated from Pig Field Schema","schema":null}, {"name":"org","type":55,"description":"autogenerated from Pig Field Schema","schema":null}], "version":0,"sortKeys":[],"sortKeyOrders":[]}

的 /data/B/.pig_schema：

{"fields": [{"name":"id","type":10,"description":"autogenerated from Pig Field Schema","schema":null}, {"name":"beta","type":55,"description":"autogenerated from Pig Field Schema","schema":null}, {"name":"delta","type":55,"description":"autogenerated from Pig Field Schema","schema":null}, {"name":"echo","type":10,"description":"autogenerated from Pig Field Schema","schema":null},], "version":0,"sortKeys":[],"sortKeyOrders":[]}

Pig - 加载具有不同模式的多个文件

1 个答案: