使用awk解析文本文件的各个部分

时间:2015-04-28 18:37:37

标签: arrays bash awk sed grep

我有一个脚本问题:

  1. 将正确的变量传递给awk
  2. awk不喜欢用于指定begninnging值的特定命令以及在指定模式之间打印的结束值。
  3. 这是states.txt的内容:

    Alabama
    
    Area: 52,423 sq.mi (135,775 sq.km.), 30th
    Land: 50,750 sq.mi. (131,442 sq.km.), 28th
    Water: 1,673 sq.mi. (4,333 sq.km.), 23rd
    Coastline: 53 mi. (85 km.), 17th
    Shoreline: 607 mi. (977 km.), 19th
    
    Alaska
    
    Area: 656,425 sq.mi (1,700,134 sq.km.), 1st
    Land: 570,374 sq.mi. (1,477,263 sq.km.), 1st
    Water: 86,051 sq.mi. (222,871 sq.km.), 1st
    Coastline: 6,640 mi. (10,686 km.), 1st
    Shoreline: 33,904 mi. (54,563 km.), 1st
    
    Arizona
    
    Area: 114,006 sq.mi (295,274 sq.km.), 6th
    Land: 113,642 sq.mi. (294,332 sq.km.), 6th
    Water: 364 sq.mi. (943 sq.km.), 48th
    
    Arkansas
    
    Area: 53,182 sq.mi (137,741 sq.km.), 29th
    Land: 52,075 sq.mi. (134,874 sq.km.), 27th
    Water: 1,107 sq.mi. (2,867 sq.km.), 31st
    
    California
    
    Area: 163,707 sq.mi (423,999 sq.km.), 3rd
    Land: 155,973 sq.mi. (403,969 sq.km.), 3rd
    Water: 7,734 sq.mi. (20,031 sq.km.), 6th
    Coastline: 840 mi. (1,352 km.), 3rd
    Shoreline: 3,427 mi. (5,515 km.), 5th
    
    Colorado
    
    Area: 104,100 sq.mi (269,618 sq.km.), 8th
    Land: 103,730 sq.mi. (268,660 sq.km.), 8th
    Water: 371 sq.mi. (961 sq.km.), 46th'
    

    等等等等

    我要做的是开发一个脚本,在解析它时单独为每个状态提取信息。

    所以脚本看起来像这样:

    for state in $(cat states.txt | egrep -v 'Area|Land|Water' | grep [A-Z]) ; do 
    
    echo $state >> ./statelist.txt ; 
    
    done ;
    
    for statesnip in $(cat ./statelist.txt | awk 'NR>1{print p "_" $0 ORS} {p=$0}' | grep [A-Z]) ; do 
    
        state1=$(echo $statesnip | awk -F _ '{print $1}') ; 
        state2=$(echo $statesnip | awk -F _ '{print $2}') ; 
    
        cat ./states.txt | awk '/$state1/{f=1}; /$state2/{f=0}' >> $state1.tmp.txt ; 
    
    done;
    
    rm -f ./statelist.txt
    

    所以这就是破坏:

    第一个是传入awk的变量:

    ,如

    awk -v state1=$state1 -v state2=$state2 '/state1/{f=1} f; /state2/{f=0}';
    

    awk -v state1=${state1} state2=${state2} '/state1/{f=1} f; /state2/{f=0}';
    

    我收到错误

    第二个问题是,当我将变量调整为-v格式时,awk并不喜欢它(它只是整个文件的猫,很多次)。

     awk -v state1=${state1} -v state2=${state2} 'state1{f=1} f; state2{f=0}'
    

    我只是反复得到整个文件的完整标记。

    预期输出应如下所示:

    cat ./statelist.txt
    
    Alabama
    Alaska
    Arizona
    Arkansas
    California
    Colorado
    
    cat ./statelist.txt | awk 'NR>1{print p "_" $0 ORS} {p=$0}' | grep [A-Z]
    
    Alabama_Alaska
    Alaska_Arizona
    Arizona_Arkansas
    Arkansas_California
    California_Colorado
    
    cat ./Alabama.txt:
    
    Alabama
    
    Area: 52,423 sq.mi (135,775 sq.km.), 30th
    Land: 50,750 sq.mi. (131,442 sq.km.), 28th
    Water: 1,673 sq.mi. (4,333 sq.km.), 23rd
    Coastline: 53 mi. (85 km.), 17th
    Shoreline: 607 mi. (977 km.), 19th
    
    cat ./Alaska.txt
    
    Alaska
    
    Area: 656,425 sq.mi (1,700,134 sq.km.), 1st
    Land: 570,374 sq.mi. (1,477,263 sq.km.), 1st
    Water: 86,051 sq.mi. (222,871 sq.km.), 1st
    Coastline: 6,640 mi. (10,686 km.), 1st
    Shoreline: 33,904 mi. (54,563 km.), 1st
    
    cat ./Arizona.txt
    
    Arizona
    
    Area: 114,006 sq.mi (295,274 sq.km.), 6th
    Land: 113,642 sq.mi. (294,332 sq.km.), 6th
    Water: 364 sq.mi. (943 sq.km.), 48th
    
    cat ./Arkansas.txt
    
    Arkansas
    
    Area: 53,182 sq.mi (137,741 sq.km.), 29th
    Land: 52,075 sq.mi. (134,874 sq.km.), 27th
    Water: 1,107 sq.mi. (2,867 sq.km.), 31st
    
    cat ./California.txt
    
    California
    
    Area: 163,707 sq.mi (423,999 sq.km.), 3rd
    Land: 155,973 sq.mi. (403,969 sq.km.), 3rd
    Water: 7,734 sq.mi. (20,031 sq.km.), 6th
    Coastline: 840 mi. (1,352 km.), 3rd
    Shoreline: 3,427 mi. (5,515 km.), 5th
    
    cat ./Colorado.txt
    
    Colorado
    
    Area: 104,100 sq.mi (269,618 sq.km.), 8th
    Land: 103,730 sq.mi. (268,660 sq.km.), 8th
    Water: 371 sq.mi. (961 sq.km.), 46th'
    

2 个答案:

答案 0 :(得分:4)

任何时候你在shell中编写循环只是为了操作文本你都有错误的方法。

在这种情况下,它看起来就像你真正需要的一切是:

awk 'NF==1{out=$1".txt"} {print > out}' states.txt

如果不是,请澄清。哦,非gawk,您可能需要在close(out)之前添加out=...

答案 1 :(得分:2)

虽然问题暗示awk正被用于解析文件,但给定的脚本使用的命令多于使用awk的其他命令。 Awk可以用来完成整个工作。

awk \
  ' \
    BEGIN \
    { FS = ":" }
    NF == 1 && /^[A-Z]/ \
    { FILE = $0 ".txt"; printf "\n%s\n\n", $0 >FILE }
    NF > 1 \
    { print >FILE }
  ' states.txt

虽然一个较小的脚本可以完成这项工作,但这个有一点额外的。使用冒号作为字段分隔符可以快速区分数据和标题行。空行被忽略,printf()用于在输出文件中生成标题行。这意味着输入文件中不需要空格,这意味着额外的空格或空白行不会弄乱输出。这可能是也可能不是你想要的。