unix解析文本文件并根据模式拆分为多个文件

时间:2018-01-06 17:58:17

标签: unix awk split grep

我有这样的文件,我想根据模式将文件拆分为多个文件。每个块都有一些(Job Number =)的信息,第一行有父信息,如%HOSTNAME#PARENT_UNIQUE_ID_xxxxxx.JOB_NAME

我想要提取%HOSTNAME#PARENT_UNIQUE_ID_xxxxxx.JOB_NAME之间的行,包括行%HOSTNAME#PARENT_UNIQUE_ID_xxxxxx.JOB_NAME

以下是我正在做的事情,这是根据需要拆分文件,如下所示。

HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME_jobProperties.txt
HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME_jobProperties.txt

while IFS= read line ; do
        if [[ $line =~ "%sj" ]]; then
                job_prop_objct_name=$(echo $line | grep -o -P '(?<= ).*')
                echo $line > $job_prop_objct_name"_jobProperties.txt"
        else
                echo $line >> $job_prop_objct_name"_jobProperties.txt"
        fi
done < $1

但问题是,在文本文件中有时会有多个作业(Job Number =),示例中我的文本示例中的最后两个块已发布,我的代码将这些作用合并到一个文件中。

想要将这些块拆分为不同的文件可能会将作业编号添加到文件中。

文字档案

%sj HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12345
Time Information
Maximum Duration =
Extra Information
-
%sj HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12346
Time Information
Maximum Duration =
Extra Information
-
%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12347
Time Information
Maximum Duration =
Extra Information
-
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12348
Time Information
Maximum Duration =
Extra Information
-

结果文件目前看起来像这样..

HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12345
Time Information
Maximum Duration =
Extra Information
-

HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12346
Time Information
Maximum Duration =
Extra Information
-

HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12347
Time Information
Maximum Duration =
Extra Information
-
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12348
Time Information
Maximum Duration =
Extra Information
-

我希望文件HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME.txt能够拆分为多个文件,具体取决于此示例中的作业编号。

HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME_12347.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12347
Time Information
Maximum Duration =
Extra Information
-

HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME_12348.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12348
Time Information
Maximum Duration =
Extra Information
-

更新: - 解决方法,但不是一个完整的解决方案。 。
这是最接近我可以得到的一个警告的解决方法,我敢肯定这是丑陋的方式。

split_JobPropsFile () {
counter=1
while IFS= read line ; do
if [[ $line =~ "%sj" ]]; then
        job_prop_objct_name=$(echo $line | grep -o -P '(?<= ).*')
        echo $line > $job_prop_objct_name"_"$counter"_jobProperties.txt"
else
        echo $line >> $job_prop_objct_name"_"$counter"_jobProperties.txt"
                if [[ $line =~ "-" ]]; then
                ((counter++))
                #echo "End of Block"
                echo "%sj" $job_prop_objct_name >> $job_prop_objct_name"_"$counter"_jobProperties.txt"
                fi
fi
done < $1
}

上面的代码正在做我期待的事情。除此之外,它只在“%sj”行的循环结束时创建一个额外的文件。

当然,它可能不是一种实现这一目标的智能方式,当我的输入文件很大并且我可能不知道的其他问题如打开文件等时也很费时...

是否可以使用awk来解决使用此解决方法创建的额外文件的警告?

1 个答案:

答案 0 :(得分:1)

我认为你在寻找:

awk '/^%sj/   { prefix  = $2; content = "" } 
              { content = content "\n" $0        }
     /^Job N/ { close(fname); fname = prefix "_" $4 ".txt"   }
     /^-/     { print substr(content,2) > fname }
    ' MyTextFile