我在尝试加载某些目录并处理它们时面临很多困难。
我的想法是想要处理所有未处理的文件。为了做到这一点,我每次处理完毕后都将我的进程时间戳存储在hdfs中。这样就可以更容易地确定文件是否被处理(通过测量最后的处理时间戳和当前时间戳)。
这是我的剧本:
--process latest
register hdfs:/udf/myudf.jar
define toDate tech.main.tics.convertDate();
define startTS tech.main.tics.startTS();
define endTS tech.main.tics.endTS();
raw = LOAD 'hdfs:/home/raw/report/last_process_time/part-r-00000' AS DATE;
start_ts = foreach raw generate startTS(DATE);
end_ts = FOREACH raw GENERATE endTS(ToUnixTime(CurrentTime()));
store start_ts into /home/raw/report/start-ts
store end_ts into /home/raw/report/end-ts
run -param START=/home/raw/report/start-ts/part-m-00000 -param END=/home/raw/report/end-ts/part-r-00000 hdfs:/home/raw/pig-script/update_test.pig
这是我的update_test.pig
register 'hdfs:/udf/elephant-bird-pig-4.10.jar';
register 'hdfs:/udf/elephant-bird-core-4.10.jar';
register 'hdfs:/udf/elephant-bird-hadoop-compat-4.10.jar';
register 'hdfs:/udf/json-simple-1.1.1.jar';
register hdfs:/udf/myudf.jar
define toDate tech.main.tics.convertDate();
define toBag tech.main.tics.MapToBag();
last_processed = LOAD 'hdfs:/home/raw/report/last_process_time/part-r-00000' AS (DATE);
previous1 = LOAD 'hdfs:/home/raw/report/events_by_application/part-r-00000';
raw = LOAD '/home/raw/dummy-logs/{$START..$END}/*' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
scene = foreach raw generate
(float)json#'value' AS VALUE,
(long)json#'ts' AS TS,
toDate(json#'ts') AS DATE;
store scene into 'hdfs:/home/raw/report2/total-scene';
--temporarily disabled
--rmf /home/raw/report/
--fs -mv /home/raw/report2/. /home/raw/report
--rmf /home/raw/report2
PIG继续读取我的替换参数作为路径而不是其内容。
我想知道我做错了什么?
感谢