处理大分隔文件的Bash循环中的问题

时间:2017-10-27 02:27:27

标签: bash grep

我必须找出生产中的工作是否有备份工作。作业区域用后缀表示,PS表示生产,PP表示备份。此外,我需要确保不仅名称相同(最后两个字符除外),而且它们引用的脚本也是相同的。

我使用了一个双循环。我回显了内容和所有数据线,捕获的greps,回显到while循环。脚本数据是好的,直到我到达if语句,在那里我推断脚本名称,然后将它们相互比较。当我运行这些工作时,我可以看到哪些工作没有排成一行,但是,我需要这些if语句来为我工作。 Autosys中有超过24,000个工作岗位,生产和备份之间的分配很小,但即使是轻微的也是相当可观的。手动检查电子表格太多了。

#!/bin/bash

IFS=,

file="/tmp/casper_test.txt"

while read -r area job machine script
do
    prod_line=$(grep  ${job%??} $file)
    echo "$prod_line" | while IFS=, read -r area job machine script
    do
        if [ "$area" == "PROD" ] ; then
            prod_script="$script"
        elif [ "$area" == "BACKUP" ] ; then
            backup_script="$script"

        elif [ "$prod_script" == "$backup_script" ] ; then
            echo "MATCH,$area,$job,$machine,$script "
        else
            echo "NO MATCH,$area,$job, $machine, $script "
        fi
    done
done < $file

输入文件/tmp/casper_test.txt

BACKUP, CAPSER_JOB_01_PP, usa-penguin.com, /bin/bash -lc '/usr/bin/run.sh'
PROD, CAPSER_JOB_01_PS, usa-penguin.com, /bin/bash -lc '/usr/bin/run.sh'
BACKUP, CAPSER_JOB_02_PP, usa-penguin.com, /bin/bash -lc '$HOME/run/script02'
PROD, CAPSER_JOB_02_PS, usa-penguin.com, /bin/bash -lc '$HOME/run/comeAndPlay'
BACKUP, CAPSER_03_PP, usa-penguin.com, /bin/bash -lc '$HOME/run/script03'
PROD, CAPSER_JOB_03_PS, usa-penguin.com, /bin/bash -lc '$HOME/run/script03'
BACKUP, CAPSER_JOB_04_PP, usa-penguin.com, /bin/bash -lc '$HOME/run/script04'
PROD, CAPSER_JOB_04_PS, usa-penguin.com, /bin/bash -lc '$HOME/run/withUsDanny'
PROD, CAPSER_JOB_05_PS, usa-penguin.com, /bin/bash -lc '$HOME/run/script05'
PROD, CAPSER_JOB_06_PS, usa-penguin.com, /bin/bash -lc '$HOME/run/script06'
BACKUP, CAPSER_JOB_07_PP, usa-penguin.com, /bin/bash -lc '$HOME/run/script07'
PROD, CAPSER_JOB_07_PS, usa-penguin.com, /bin/bash -lc '$HOME/run/script07'

3 个答案:

答案 0 :(得分:2)

由于您真正需要的是没有匹配备份作业的生产作业名称列表,因此这里列出了一个awk脚本:

awk -F ', *' '{gsub("_..$", "", $2)} /BACKUP/{b[$2] = $NF} /PROD/{p[$2] = $NF} END {for (i in p) if (p[i] != b[i]) print i}'
  • -F ', *' - 用逗号分隔后跟空格
  • {gsub("_..$", "", $2)}从作业名称中移除后缀,即第二个字段
  • /BACKUP/{b[$2] = $NF} /PROD/{p[$2]=$NF}将备份脚本保存在一个阵列中,将prod脚本保存在另一个阵列中
  • END {for (i in p) if (p[i] != b[i]) print i} - 读完所有行后,循环浏览prod脚本并在备份中打印没有匹配脚本

示例输出:

CAPSER_JOB_02
CAPSER_JOB_03
CAPSER_JOB_04
CAPSER_JOB_05
CAPSER_JOB_06

具有这些ID的作业都没有匹配,其余的匹配。

至于shell脚本,看一下内部while循环中会发生什么:

echo "$prod_line" | while IFS=, read -r area job machine script
do
    if [ "$area" == "PROD" ] ; then
        prod_script="$script"
    elif [ "$area" == "BACKUP" ] ; then
        backup_script="$script"

    elif [ "$prod_script" == "$backup_script" ] ; then
        echo "MATCH,$area,$job,$machine,$script "
    else
        echo "NO MATCH,$area,$job, $machine, $script "
    fi
done

grep输出中的行数不会超过两行,其中包含BACKUPPROD。因此,您的第三个elifelse永远不会到达。那些应该可以移到内部循环之外,这样当你读完两个时就会发生测试。由于缺少某些备份作业,您可能希望在读取之前清除这些值,以便不重复使用旧值。

答案 1 :(得分:2)

您可以在纯Bash中使用哈希和输入文件中的单个读取来执行此操作。在输入文件中有24K行,这种方法比读取 n + 1 次文件的解决方案更有效,对于具有24K行的文件,这种方法是24001次!我也添加了一些基本的错误处理。

#!/bin/bash
line=0
declare -A prod_jobs_job prod_jobs_scripts prod_jobs_machines backup_jobs_scripts
while IFS=, read -r area job machine script; do
    ((line++))
    j="${job%??}"
    if [[ $area == "PROD" ]]; then
      prod_jobs_job[$j]="$job"           # this hash holds the original job name
      prod_jobs_scripts[$j]="$script"    # holds the prod script
      prod_jobs_machines[$j]="$machine"  # holds the prod machine, used for printing only
    elif [[ $area == "BACKUP" ]]; then
      backup_jobs_scripts[$j]="$script"  # holds the backup script, used for comparison
    else
      printf '%s\n' "Unknown area '$area' at line number $line" >&2
    fi
done < <(sed 's/, */,/g' t1) # make sure to strip out the spaces after commas

# traverse the prod jobs hash and compare with backup
# if there is no match in backup hash, treat it as an error
for j in "${!prod_jobs_scripts[@]}"; do
    prod_script="${prod_jobs_scripts[$j]}"
    job="${prod_jobs_job[$j]}"
    backup_script="${backup_jobs_scripts[$j]}"
    [[ ! $backup_script ]] && { printf '%s\n' "No backup job for '$job'" >&1; continue; }
    prod_machine="${prod_jobs_machines[$j]}"
    if [[ $prod_script == $backup_script ]]; then
      printf '%s\n' "MATCH:PROD,$job,$prod_machine,$prod_script"
    else
      printf '%s\n' "NO MATCH:PROD,$job,$prod_machine,$prod_script"
    fi
done

对于您的输入文件,我们得到此输出:

MATCH:PROD,CAPSER_JOB_07_PS,usa-penguin.com,/bin/bash -lc '$HOME/run/script07'
No backup job for 'CAPSER_JOB_06_PS'
MATCH:PROD,CAPSER_JOB_01_PS,usa-penguin.com,/bin/bash -lc '/usr/bin/run.sh'
NO MATCH:PROD,CAPSER_JOB_02_PS,usa-penguin.com,/bin/bash -lc '$HOME/run/comeAndPlay'
No backup job for 'CAPSER_JOB_03_PS'
NO MATCH:PROD,CAPSER_JOB_04_PS,usa-penguin.com,/bin/bash -lc '$HOME/run/withUsDanny'
No backup job for 'CAPSER_JOB_05_PS'

答案 2 :(得分:0)

更新

尝试其他选择:

grep PROD /tmp/casper.txt > PROD.txt
grep BACKUP /tmp/casper.txt > BACKUP.txt

awk 'FNR==NR{a[$6];b[substr($2,0,13)];next}($6 in a && substr($2,0,13) in b){print}' BACKUP.txt PROD.txt

这将导致并且可以持续输入文件中的大量行....

 PROD, CAPSER_JOB_01_PS, usa-penguin.com, /bin/bash -lc '/usr/bin/run.sh'
 PROD, CAPSER_JOB_07_PS, usa-penguin.com, /bin/bash -lc '$HOME/run/script07'

对于较大的输入文件,以下代码是不可持续的。

您使while loop过于复杂,并且通过对两个循环使用相同的变量名称而有点错误。看看以下内容是否适合您。

#!/bin/bash

IFS=,
file="casper.txt"
while read -r area job machine script
do
    if [ "$area" == "PROD" ] ; then
        prod_script="$script"
        jobname=${job%??}
        IFS=,
        while read -r area1 job1 machine1 script1
        do
            if [ "$area1" == "BACKUP" ]; then
            jobname1=${job1%??}
                if [ "$jobname" == "$jobname1" ]; then
                    if [ "$prod_script" == "$script1" ] ; then
                        echo "MATCH: $area,$job,$machine,$script"
                        break;
                    fi
                fi
            fi
        done < "$file"
    fi
done < "$file"

这将来自您的输入文件

]# ./casper
MATCH: PROD, CAPSER_JOB_01_PS, usa-penguin.com, /bin/bash -lc '/usr/bin/run.sh'
MATCH: PROD, CAPSER_JOB_07_PS, usa-penguin.com, /bin/bash -lc '$HOME/run/script07'