Question

我正在尝试拆分一个大型日志文件，一次包含几个月的日志条目，我正在尝试按日期将其拆分为日志文件。有数以千计的行如下：

Sep 4 11:45 kernel: Entry
Sep 5 08:44 syslog: Entry

我正在尝试将其拆分，以便文件logfile.20090904和logfile.20090905包含条目。

我已经创建了一个程序来读取每一行，并将其发送到相应的文件，但它运行得很慢（特别是因为我必须将一个月的名称变为一个数字）。我已经考虑过每天做一个grep，这需要找到文件中的第一个日期，但这似乎也很慢。

有更优化的解决方案吗？也许我错过了一个更好的命令行程序。

这是我目前的解决方案：

#! /bin/bash
cat $FILE | while read line; do
  dts="${line:0:6}"
  dt="`date -d "$dts" +'%Y%m%d'`"
  # Note that I could do some caching here of the date, assuming
  # that dates are together.
  echo $line >> $FILE.$dt 2> /dev/null
done

Answer 1

@OP尽量不要在读取循环时使用bash来迭代大文件。它尝试并证明它很慢，而且，你正在为你读取的文件的每一行调用外部日期命令。这是一种更有效的方法，仅使用gawk

gawk 'BEGIN{
 m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",mth,"|")     
}
{ 
 for(i=1;i<=m;i++){ if ( mth[i]==$1){ month = i } }
 tt="2009 "month" "$2" 00 00 00" 
 date= strftime("%Y%m%d",mktime(tt))
 print $0 > FILENAME"."date
}
' logfile

输出

$ more logfile
Sep 4 11:45 kernel: Entry
Sep 5 08:44 syslog: Entry

$ ./shell.sh

$ ls -1 logfile.*
logfile.20090904
logfile.20090905

$ more logfile.20090904
Sep 4 11:45 kernel: Entry

$ more logfile.20090905
Sep 5 08:44 syslog: Entry

Answer 2

根据您已经完成的工作，最简单的方法是简单地将文件命名为“Sep 4”等等，然后在最后重命名它们 - 这样您只需要读取一定数量的字符，没有额外的处理。

如果由于某种原因您不想这样做，但是您知道日期是有序的，您可以在两种形式中缓存上一个日期，并进行字符串比较以确定是否需要再次运行日期或者只使用旧的缓存日期。

最后，如果速度确实存在问题，你可以尝试使用perl或python而不是bash。你不是在这里做任何太疯狂的事情（除了开始每一行的子shell和日期过程，我们已经想出了如何避免），所以我不知道它会有多大帮助。

Answer 3

脚本的骨架：

BIG_FILE=big.txt

# remove $BIG_FILE when the script exits
trap "rm -f $BIG_FILE" EXIT

cat $FILES > $BIG_FILE || { echo "cat failed"; exit 1 }

# sort file by date in place
sort -M $BIG_FILE -o $BIG_FILE || { echo "sort failed"; exit 1 }

while read line;
   # extract date part from line ...
   DATE_STR=${line:0:12} 

   # a new date - create a new file
   if (( $DATE_STR != $PREV_DATE_STR)); then 
       # close file descriptor of "dated" file
       exec 5>&- 
       PREV_DATE_STR=$DATE_STR

       # open file of a "dated" file for write
       FILE_NAME= ... set to file name ...
       exec 5>$FILE_NAME || { echo "exec failed"; exit 1 }
   fi

   echo -- $line >&5 || { echo "print failed"; exit 1 }
done < $BIG_FILE

Answer 4

此脚本执行内部循环365或366次，一年中的每一天执行一次，而不是遍历日志文件的每一行：

#!/bin/bash
month=0
months=(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)
for eom in 31 29 31 30 31 30 31 31 30 31 30 31
do
    (( month++ ))
    echo "Month $month"
    if (( month == 2 ))    # see what day February ends on
    then
        eom=$(date -d "3/1 - 1 day" +%-d)
    fi
    for (( day=1; day<=eom; day++ ))
    do
        grep "^${months[$month - 1]} $day " dates.log > temp.out
        if [[ -s temp.out ]]
        then
            mv temp.out file.$(date -d $month/$day +"%Y%m%d")
        else
            rm temp.out
        fi
        # instead of creating a temp file and renaming or removing it,
        # you could go ahead and let grep create empty files and let find
        # delete them at the end, so instead of the grep and if/then/else
        # immediately above, do this:
        # grep --color=never "^${months[$month - 1]} $day " dates.log > file.$(date -d $month/$day +"%Y%m%d")
    done
done
# if you let grep create empty files, then do this:
# find -type f -name "file.2009*" -empty -delete

将日志条目安排到日期文件中

4 个答案: