bash识别和验证文件头

时间:2017-04-17 17:09:07

标签: bash

使用下面的tab-delimited file我尝试验证标题行1,然后将该数字存储在变量$header中,以便在几个if语句中使用。如果$header等于10,则为file has expected number of fields,但如果$header小于10 file is missing header for:,则会在下方打印缺少的标题字段。 bash似乎很接近,如果我单独使用awk它似乎完美无缺,但我似乎无法在if中使用它。谢谢你:)。

file.txt的

Index   Chr Start   End Ref Alt Freq    Qual    Score   Input
1    1    1    100    C    -    1    GOOD    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

FILE2.TXT

Index   Chr Start   End Ref Alt Freq    Qual    Score
1    1    1    100    C    -    1    GOOD    10
2    2    20    200    A    C    .002    STRAND BIAS    2
3    2    270    400    -    GG    .036    GOOD    6

的bash

for f in /home/cmccabe/Desktop/validate/*.txt; do
   bname=`basename $f`
   pref=${bname%%.txt}
   header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output  # detect header row in file and store in header and write to output
       if [[ $header == "10" ]]; then   # display results
          echo "file has expected number of fields"   # file is validated for headers
      else
          echo "file is missing header for:"  # missing header field ...in file not-validated
          echo "$header"
      fi  # close if.... else    
done >> ${pref}_output

file.txt

的所需输出
file has expected number of fields

file1.txt

的所需输出
file is missing header for:
Input

3 个答案:

答案 0 :(得分:2)

如果您愿意,可以使用awk,但bash能够自行处理第一行字段比较。如果维护一个预期字段名称数组,则可以轻松地将第一行拆分为字段,与预期的字段数进行比较,如果读取的字段数少于预期的任何给定字段数,则输出丢失字段的标识文件。

以下是将文件名作为参数的简短示例(您需要从stdin获取大量文件的文件名,或根据需要使用xargs。该脚本只读取每个文件中的第一行,将行分隔为字段,检查字段计数,并在短消息中输出任何缺少的字段:

#!/bin/bash

declare -i header=10    ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
         "Chr"
         "Start"
         "End"
         "Ref"
         "Alt"
         "Freq"
         "Qual"
         "Score"
         "Input" )

for i in "$@"; do           ## for each file given as argument
    read -r line < "$i"     ## read first line from file into 'line'

    oldIFS="$IFS"           ## save current Internal Field Separator (IFS)
    IFS=$'\t'               ## set IFS to word-split on '\t'

    fldarray=( $line );     ## fill 'fldarray' with fields in line

    IFS="$oldIFS"           ## restore original IFS

    nfields=${#fldarray[@]} ## get number of fields in 'line'

    if (( nfields < header ))   ## test against header
    then
        printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
        for j in "${fields[@]}" ## for each expected field
        do  ## check against those in line, if not present print
            [[ $line =~ $j ]] || printf " %s" "$j"
        done
        printf "\n\n"   ## tidy up with newlines
    fi
done

示例输入

$ cat dat/hdr.txt
Index   Chr     Start   End     Ref     Alt     Freq    Qual    Score   Input
1       1       1       100     C       -       1       GOOD    10      .
2       2       20      200     A       C       .002    STRAND BIAS     2       .
3       2       270     400     -       GG      .036    GOOD    6       .

$ cat dat/hdr2.txt
Index   Chr     Start   End     Ref     Alt     Freq    Qual    Score
1       1       1       100     C       -       1       GOOD    10
2       2       20      200     A       C       .002    STRAND BIAS     2
3       2       270     400     -       GG      .036    GOOD    6

$ cat dat/hdr3.txt
Index   Chr     Start   End     Alt     Freq    Qual    Score   Input
1       1       1       100     -       1       GOOD    10      .
2       2       20      200     C       .002    STRAND BIAS     2       .
3       2       270     400     GG      .036    GOOD    6       .

示例使用/输出

$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input

error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref

仔细研究一下,虽然awk可以做很多事情bash不能独立完成,但bash能够解析文本。

答案 1 :(得分:1)

这段代码将完全按照您的要求行事。请让我知道这对你有没有用。

 for f in ./*.txt; do

      [[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]]  && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index   Chr Start   End Ref Alt Freq    Qual    Score   Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c |  awk '/1 / {print $2}' ); 
 done

输出:

File ./file2.txt is missing  Input
File ./file.txt has all the fields on its header

答案 2 :(得分:1)

这是GNU awk(nextfile)中的一个:

$ awk '
FNR==NR {
    for(n=1;n<=NF;n++)
        a[$n]
    nextfile
}
NF==(n-1) {
    print FILENAME " file has expected number of fields"
    nextfile
}
{
    for(i=1;i<=NF;i++)
        b[$i]
    print FILENAME " is missing header for: " 
    for(i in a)
    if(i in b==0)
        print i
    nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for: 
Input

脚本处理的第一个文件定义了以下文件应具有的标头(在a中),并将它们(在b中)与它进行比较。