如何提取多个模式并创建csv文件?

时间:2017-11-07 11:48:06

标签: linux bash awk

我有awk命令:

while read -r fname
do
  part1="$(awk '/===/g {p=1; next}/***/ {exit} /^$/ {next} p==1 {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$0);gsub("\"","\"\"",$0); print}' $fname)"
  part2="$(awk '/***/{p=1; next}/###/ {exit} /^$/ {next} p==1 {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$0);gsub("\"","\"\"",$0); print}' $fname)"

  if [[ $part1 = *[!\ ]* ]] && [[ $part2 = *[!\ ]* ]]; then
    echo "$fname,\"$part1\",\"$part2\"" >> extracted_text.csv
  fi
done < Flist.txt

其中; Flist.txt文件包含文件名列表,例如

$ cat Flist.txt
file1.txt

file1.txt具有以下内容

This is section 1 
===
This is section 2
***
This is section 3
###
This is section 4
This is section 5
===
This is section 6
***
This is section 7
###
This is section 8

我正在尝试在===###之间提取文字,然后在***上拆分提取的文字。我们的想法是获得一个匹配模式的csv文件,如

file1.txt,This is section 2,This is section 3
file1.txt,This is section 6,This is section 7

相反,我只得到一场比赛

file1.txt,This is section 2,This is section 3

注意:我是awksed的新手。我之前尝试使用sed,但最终决定使用awk。但我坚持要达到理想的输出。感谢您的帮助。

4 个答案:

答案 0 :(得分:0)

Pure bash解决方案:

while read -r fname; do
  while read -r line
  do 
    if [[ "$line" = "===" ]]; then
      track=1
      continue
    elif [[ "$line" = "###" ]]; then 
      track=0
      echo "$fname,$t" | sed 's/\*\*\*/,/g' 
      t=''
    fi

    if [[ "$track" = 1 ]]; then 
      t="$t$line"
    fi 
  done < "$fname"
done < Flist.txt > output.csv

输出:

This is section 2,This is section 3
This is section 6,This is section 7

我们正在做的就是在读取文件时查找开始和结束标记,并在遇到它们时设置标记。在===之后,我们将所有内容添加到变量中,直到达到###。在遇到###时,可以打印变量,将***替换为,以进行拆分。

答案 1 :(得分:0)

#!/usr/bin/env bash

while read -r filename; do
  awk -v OFS=',' '/^[=]+/{start=1; printf FILENAME; next}/^[*]+/{next}/^[#]+/{start=0;print ""}start{printf OFS $0}' "$filename"
done < "Flist.txt" > outfile.csv

<强>解释

awk -v OFS=',' '/^[=]+/{                   # search for line starts with =
                      start=1;             # set variable start = 1
                      printf FILENAME;     # print filename
                      next                 # go to next line
                   }
                /^[*]+/{                   # if line starts with * 
                      next                 # skip, go to next line
                   }
                /^[#]+/{                   # if line starts with #
                      start=0;             # end of search, make start=0
                      print ""             # print newline char
                   }
                start{                     # as long as start is non-zero
                      printf OFS $0        # print output field separator
                                           # and current line/record/row
                  }
               ' "$filename"
  

/^[#]+/

     
      
  • ^断言字符串开头的位置匹配[#]+下面列表中的单个字符
  •   
  • +量词 - 在一次和无限次之间匹配,尽可能多次,根据需要回馈(贪婪)
  •   
     

/^[*]+//^[=]+/与上面的regexp相似

测试结果:

$ cat infile.txt
This is section 1 
===
This is section 2
***
This is section 3
###
This is section 4
This is section 5
===
This is section 6
***
This is section 7
###
This is section 8


$ awk -v OFS=',' '/^[=]+/{start=1; printf FILENAME; next}/^[*]+/{next}/^[#]+/{start=0;print ""}start{printf OFS $0}' infile.txt
infile.txt,This is section 2,This is section 3
infile.txt,This is section 6,This is section 7

答案 2 :(得分:0)

关注awk也可以帮助您。

awk '/===/{flag=1;next} /###/{flag="";print FILENAME","val;val="";next}  flag && !/^\*\*\*/{val=val?val OFS $0:$0}'  Input_file

也添加非单线形式的解决方案。

awk '
/===/{                  ##checking if a line has === in a line, if condition satisfies then do following.
  flag=1;               ##Setting variable named flag to 1.
  next                  ##using next keyword to skip all further statements.
}
/###/{                  ##Checking if a line has string ### if yes then do following.
  flag="";              ##Making variable flag as NULL here.
  print FILENAME","val; ##Printing Input_file name here by FILENAME and then comma and then value of variable named val.
  val="";               ##Nullifying the variable val here.
  next                  ##Using next will skip all further statements here.
}
flag && !/^\*\*\*/{     ##Checking conditions here if variable flag value is NOT NULL and checking if any line is NOT starting from *** then do following.
  val=val?val OFS $0:$0 ##create a variable named val whose value will be concatenating its own value(if it is NOT NULL) and will be equal to $0 if NULL.
}
' Input_file            ##Mentioning Input_file here.

答案 3 :(得分:0)

另一个awk。

while read -r fname; do
    awk -F"===|***" -v RS="###" -v OFS="," 
    '{gsub(/\n/,"",$0)} (NF==3){$1=FILENAME;print}' "$fname"
done < Flist.txt > output.csv

我们将第一个/第二个模式设置为字段分隔符,将第三个模式设置为记录分隔符,将逗号设置为输出字段分隔符。 gsub删除换行符,在用filename替换后,打印。添加NF==3只是为了不打印任何最后一个不完整的记录,如示例输入中那样。