我有3个日志,一个Squid,一个登录和一个注销。我需要跨越这些日志以找出每个用户访问过的网站。 我正在使用Apache Pig并创建了以下脚本来执行此操作:
copyFromLocal /home/marcelo/Documentos/hadoop/squid.txt /tmp/squid.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_in /tmp/login.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_out /tmp/logout.txt;
squid = LOAD '/tmp/squid.txt' USING PigStorage AS (linha: chararray);
nsquid = FOREACH squid GENERATE FLATTEN (STRSPLIT(linha,'[ ]+'));
nsquid = FOREACH nsquid GENERATE $0 AS timeStamp:chararray, $2 AS ipCliente:chararray, $5 AS request:chararray, $6 AS url:chararray;
nsquid = FOREACH nsquid GENERATE FLATTEN (STRSPLIT(timeStamp,'[.]'))AS (timeStamp:int,resto:chararray),ipCliente,request,url;
nsquid = FOREACH nsquid GENERATE (int)$0 AS timeStamp:int, $2 AS ipCliente:chararray,$3 AS request:chararray, $4 AS url:chararray;
connect = FILTER nsquid BY (request=='CONNECT');
login = LOAD '/tmp/login.txt' USING PigStorage(' ') AS (serverAL: chararray, data: chararray, hora: chararray, netlogon: chararray, on: chararray, ip: chararray);
nlogin = FOREACH login GENERATE FLATTEN(STRSPLIT(serverAL,'[\\\\]')),data, hora,FLATTEN(STRSPLIT(ip,'[\\\\]'));
nlogin = FOREACH nlogin GENERATE $1 AS al:chararray, $2 AS data:chararray, $3 AS hora:chararray, $4 AS ipCliente:chararray;
logout = LOAD '/tmp/logout.txt' USING PigStorage(' ') AS (data: chararray, hora: chararray, logout: chararray, ipAl: chararray, disconec: chararray);
nlogout = FOREACH logout GENERATE data, hora, FLATTEN(STRSPLIT(ipAl,'[\\\\]'));
nlogout = FOREACH nlogout GENERATE $0 AS data:chararray,$1 AS hora:chararray,$2 AS ipCliente:chararray, $3 AS al:chararray;
data = JOIN nlogin BY (al,ipCliente,data), nlogout BY (al,ipCliente,data);
ndata = FOREACH data GENERATE nlogin::al,ToUnixTime(ToDate(CONCAT(nlogin::data, nlogin::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogin:int,ToUnixTime(ToDate(CONCAT(nlogout::data, nlogout::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogout:int,nlogout::ipCliente;
BB = FOREACH ndata GENERATE $0 AS al:chararray, (int)$1 AS tslogin:int, (int)$2 AS tslogout:int, $3 AS ipCliente:chararray;
CC = JOIN BB BY ipCliente, connect BY ipCliente;
DD = FOREACH CC GENERATE BB::al AS al:chararray, (int)BB::tslogin AS tslogin:int, (int)BB::tslogout AS tslogout:int,(int)connect::timeStamp AS timeStamp:int, connect::ipCliente AS ipCliente:chararray, connect::url AS url:chararray;
EE = FILTER DD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
STORE EE INTO 'EEs';
但它返回以下错误
2015-10-16 21:58:10,600 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201510162141_0008 has failed! Stop running all dependent jobs
2015-10-16 21:58:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0: Error while executing ForEach at [DD[93,5]]
2015-10-16 21:58:10,667 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-10-16 21:58:10,667 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.2.1 0.12.1 root 2015-10-16 21:56:48 2015-10-16 21:58:10 HASH_JOIN,FILTER
Some jobs have failed! Stop running all dependent jobs
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_201510162141_0007 2 1 4 3 4 4 9 9 9 9 BB,data,login,logout,ndata,nlogin,nlogout HASH_JOIN
Failed Jobs:
JobId Alias Feature Message Outputs
job_201510162141_0008 CC,DD,EE,connect,nsquid,squid HASH_JOIN Message: Job failed! Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201510162141_0008_r_000000 hdfs://localhost:9000/user/root/EEb,
Input(s):
Successfully read 7367 records from: "/tmp/login.txt"
Successfully read 7374 records from: "/tmp/logout.txt"
Failed to read data from "/tmp/squid.txt"
Output(s):
Failed to produce result in "hdfs://localhost:9000/user/root/EEb"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201510162141_0007 -> job_201510162141_0008,
job_201510162141_0008
2015-10-16 21:58:10,674 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 11 time(s).
2015-10-16 21:58:10,674 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
我创造了一个有效的替代品,只是用倒数第二行替换了:
STORE DD INTO 'DD';
newDD = LOAD 'hdfs://localhost:9000/user/root/DD' USING PigStorage AS (al:chararray, tslogin:int, tslogout:int, timeStamp:int, ipCliente:chararray, url:chararray);
EE = FILTER newDD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);
有没有人知道如何在没有“商店”的情况下修复它?