我想解析一个Apache日志文件,例如:
1.1.1.1 - - [12/Dec/2019:18:25:11 +0100] "GET /endpoint1/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
1.1.1.1 - - [13/Dec/2019:18:25:11 +0100] "GET /endpoint1/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
2.2.2.2 - - [13/Dec/2019:18:27:11 +0100] "GET /endpoint1/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
2.2.2.2 - - [13/Jan/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
3.3.3.3 - - [13/Jan/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
1.1.1.1 - - [13/Feb/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
4.4.4.4 - - [13/Feb/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
4.4.4.4 - - [13/Feb/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
4.4.4.4 - - [13/Feb/2020:17:15:13 +0100] "GET /endpoint2/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
我需要获取每月访问的客户端IP列表。我有这样的东西
awk '{print $1,$4}' access.log | grep Dec | cut -d" " -f1 | uniq -c
但这是错误的,因为它计算每天的访问IP。
预期结果应该像(缩进无关紧要):
Dec 2019
1.1.1.1 2
2.2.2.2 1
Jan 2020
2.2.2.2 1
3.3.3.3 1
Feb 2020
4.4.4.4 3
1.1.1.1 1
其中2是截至2019年12月1.1.1.1 ip的总访问量。
您能建议我一种方法吗?
答案 0 :(得分:2)
一个用于GNU awk的设备,它按照输入数据的顺序输出(即,按日志顺序输出时间顺序数据,例如日志记录):
$ gawk ' # using GNU awk
BEGIN {
a[""][""] # initialize a 2D array
}
{
split($4,t,/[/:]/) # split datetime
my=t[2] OFS t[3] # my=month year
if(!(my in mye)) { # if current my unseen
mye[my]=++myi # update month year exists array with new index
mya[myi]=my # chronology is made
}
a[mye[my]][$1]++ # update record to a hash
}
END { # in the end
# PROCINFO["sorted_in"]="@val_num_desc" # this may work for ordering visits
for(i=1;i<=myi;i++) { # in fed order
print mya[i] # print month year
for(j in a[i]) # then related ips in no particular order
print j,a[i][j] # output ip and count
}
}' file
输出:
Dec 2019
1.1.1.1 2
2.2.2.2 1
Jan 2020
2.2.2.2 1
3.3.3.3 1
Feb 2020
1.1.1.1 1
4.4.4.4 3
答案 1 :(得分:1)
尽管您的示例预期输出看起来与所显示的示例不匹配,但根据您显示的示例输出和描述,您可以尝试以下方法。另外,由于这是一个日志文件,因此我将使用awk
的字段分隔符方法,因为日志将采用固定模式。
awk -F':| |-|/+|]' '
{
ind[$7 OFS $8 OFS $1]++
value[$7 OFS $8 OFS $1]=$1
}
END{
for(i in value){
split(i,arr," ")
print arr[1],arr[2] ORS value[i],ind[i]
}
}' Input_file
说明: 添加以上详细说明。
awk -F':| |-|/+|]' ' ##Starting awk program from here and setting field separators as : space - / ] here.
{
ind[$7 OFS $8 OFS $1]++ ##Creating ind array whose index is 7th 8th and 1st field and keep increasing value with 1 here.
value[$7 OFS $8 OFS $1]=$1 ##Creating value with index of 7th, 8th and 1st field and its value is 1st field.
}
END{ ##Starting END block of this program from here.
for(i in value){ ##Traversing through value elements here.
split(i,arr," ") ##Splitting i into array arr with delimiter as space here.
print arr[1],arr[2] ORS value[i],ind[i] ##Printing 1st and 2nd element of arr with ORS(new line) and array value and ind value here.
}
}' Input_file ##Mentioning Input_file name here.
答案 2 :(得分:0)
尝试一下。
外壳:
#!/usr/bin/env bash
LOG_FILE=$1
#regex to find mmm/yyyy
dateUniq=`grep -oP '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\/\d{4}' $LOG_FILE | sort | uniq`
for i in $dateUniq
do
#output mmm yyyy
echo $i | sed 's/\// /g'
#regex to find ip
ipUniq=`grep $i $LOG_FILE | grep -oP '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' | sort | uniq`
for x in $ipUniq
do
count=`grep $i $LOG_FILE |grep -c $x`
#output count ip
echo $count $x
done
echo
done
输出:
Dec 2019
2 1.1.1.1
1 2.2.2.2
Feb 2020
1 1.1.1.1
3 4.4.4.4
Jan 2020
1 2.2.2.2
1 3.3.3.3