Question

我想打印单词[如果一个模式像值=＆＃34;任何字符串＆＃34;但不是值=＆＃34;＃{任何字符串}＆＃34; ]在目录及其子目录中的所有文件中。

dir1
     file1
           ( content like ..... value="GOD Grace" .....
                ....................................value="#{blog}"......
                ... value="Greek" ...)
     file2
           ( content like ..... value="Sounder rajan" .....
             ....................................value="#{feek}".....
             ....................................value="patient"....)


             subdir1
                       file3
                            ( content like ..... value="Guice" .....
                            ....................................value="#{slog}"......
                            ... value="guide" ...)

我希望像

一样

         filename  filewordno   wordsExtract   uniqno
         file1        1                GOD Grace      1
         file1        2                Greek              2
         file2        1                Sounder rajan   3
         file2        2                patient             4
         file3        1                Guice              5
         file3        2                guide               6

我的尝试：

no=0;

for SourceFile in *.xhtml
do
    pagename=$(basename $filename .xhtml)
    cat $SourceFile | gawk 'BEGIN {FS="[ \"]"}
     wno=0;
/value=/ && !/value=\"#/ && !/pages/ && !/value=\"[0-9]\"/  {
for (i=1; i<NF; i++) {

    if (( !/#/ && /value=/ ) && $i == "value=" && $(i+1)!=""  && $(i+1)!=":"  && $(i+1)!="*" ){
        print SourceFile,++wno,$(i+1),++no;

    }

}
 }'
done >>  path/Outputfilename

我的输出

filename  filewordno   wordsExtract   uniqno
             -        1                Grace              1
             -        1                Greek              2
             -        1                Sounder           1
             -        1                patient             2

我的3个问题

在空间上分裂的单词我希望字符串有空格，例如GOD Grace＆＃39;不喜欢Grace .enter code here
我也想要子目录文件。但是我的脚本只能打印主目录文件
我希望所有单词都有独特的s.nofor。

我在这里学习和工作了一个星期。如果你有时间，你的帮助对我来说更有利。

感谢

Answer 1

我想循环中的简单grep命令可以做你想要的，如果你可以接受没有awk解析的解决方案那么请检查下面的脚本及其输出，我使用了你用过的相同内容在你的问题中。

脚本（ extract_values.sh ）

#!/bin/bash
# Loop to parse all files recursively in current directory
for file in `find . -type f -name "*.xhtml" -print`
do
  v_currdir=`pwd`              # store the current working directory in a variable
  v_file_path=`dirname $file`  # extract file path seperately
  v_file_name=`basename $file` # extract file name seperately
  cd $v_file_path              # change to that directory

  # command to extract the required data
  grep -o -H 'value=\".*\"' $v_file_name | grep -v 'value=\"#.*\"' | sed 's/value=//g' | grep -nv 'StringNotToBeFound'

  # Again change the directory to current working directory for next itreation
  cd $v_currdir
done

脚本的示例执行

$ ls
dir1  extract_values.sh

$ find . -print
.
./dir1
./dir1/file1.xhtml
./dir1/file2.xhtml
./dir1/subdir1
./dir1/subdir1/file3.xhtml
./extract_values.sh

$ # this is the command using the above script to add header 
$ # And change the delimiter to tab from colon using tr command
$ (echo "UniqNO:FWordNo:FileName:WordsExtract"; extract_values.sh | nl -s: ) | tr ':' "\t"
UniqNO  FWordNo FileName        WordsExtract
     1  1       file1.xhtml     "GOD Grace"
     2  2       file1.xhtml     "Greek"
     3  1       file2.xhtml     "Sounder rajan"
     4  2       file2.xhtml     "patient"
     5  1       file3.xhtml     "Guice"
     6  2       file3.xhtml     "guide"
$

使用awk或gawk从目录文件及其子目录文件中提取和打印特定模式字符串

1 个答案: