如何对3G字节访问日志文件进行排序?

时间:2016-04-25 08:03:45

标签: shell sorting

大家好:现在我有一个名为urls的3G字节tomcat访问日志,每行都是一个url。我想计算每个网址并按照每个网址的数量对这些网址排序。我是这样做的:

awk '{print $0}' urls | sort | uniq -c | sort -nr >> output

但是花了很长时间才完成这项工作,它已经用了30分钟而且还在工作。 日志文件如下:

/open_api/borrow_business/get_apply_by_user
/open_api/borrow_business/get_apply_by_user
/open_api/borrow_business/get_apply_by_user
/open_api/borrow_business/get_apply_by_user
/loan/recent_apply_info?passportId=Y20151206000011745
/loan/recent_apply_info?passportId=Y20160331000000423
/open_api/borrow_business/get_apply_by_user
...

有没有其他方法可以处理和排序3G字节文件?提前谢谢!

2 个答案:

答案 0 :(得分:0)

我不确定为什么你现在正在使用awk - 它没有做任何有用的事情。

我建议使用这样的东西:

awk '{ ++urls[$0] } END { for (i in urls) print urls[i], i }' urls | sort -nr

这将构建每个URL的计数,然后对输出进行排序。

答案 1 :(得分:0)

I generated a sample file of 3,200,000 lines, amounting to 3GB, using Perl like this:

perl -e 'for($i=0;$i<3200000;$i++){printf "%d, %s\n",int rand 1000, "0"x1000}' > BigBoy

I then tried sorting it in one step, followed by splitting it into 2 halves and sorting the halves separately and merging the results, then splitting into 4 parts and sorting separately and merging, then splitting into 8 parts and sorting separately and merging.

This resulted, on my machine at least, in a very significant speedup.

enter image description here

Here is the script. The filename is hard-coded as BigBoy, but could easily be changed and the number of parts to split the file into must be supplied as a parameter.

#!/bin/bash -xv
################################################################################
# Sort large file by parts and merge result
#
# Generate sample large (3GB with 3,200,000 lines) file with:
# perl -e 'for($i=0;$i<3200000;$i++){printf "%d, %s\n",int rand 1000, "0"x1000}' > BigBoy
################################################################################
file=BigBoy
N=${1:-1}
echo  $N
if [ $N -eq 1 ]; then
   # Straightforward sort
   sort -n "$file" > sorted.$N
else
   rm sortedparts-* parts-* 2> /dev/null
   tlines=$(wc -l < "$file")
   echo $tlines
   ((plines=tlines/N))
   echo $plines
   split -l $plines "$file" parts-
   for f in parts-*; do
      sort -n "$f" > "sortedparts-$f" &
   done
   wait
   sort -n -m sortedparts-* > sorted.$N
fi

Needless to say, the resulting sorted files are identical :-)