我有awk命令:
while read -r fname
do
part1="$(awk '/===/g {p=1; next}/***/ {exit} /^$/ {next} p==1 {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$0);gsub("\"","\"\"",$0); print}' $fname)"
part2="$(awk '/***/{p=1; next}/###/ {exit} /^$/ {next} p==1 {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$0);gsub("\"","\"\"",$0); print}' $fname)"
if [[ $part1 = *[!\ ]* ]] && [[ $part2 = *[!\ ]* ]]; then
echo "$fname,\"$part1\",\"$part2\"" >> extracted_text.csv
fi
done < Flist.txt
其中; Flist.txt
文件包含文件名列表,例如
$ cat Flist.txt
file1.txt
和file1.txt
具有以下内容
This is section 1
===
This is section 2
***
This is section 3
###
This is section 4
This is section 5
===
This is section 6
***
This is section 7
###
This is section 8
我正在尝试在===
和###
之间提取文字,然后在***
上拆分提取的文字。我们的想法是获得一个匹配模式的csv文件,如
file1.txt,This is section 2,This is section 3
file1.txt,This is section 6,This is section 7
相反,我只得到一场比赛
file1.txt,This is section 2,This is section 3
注意:我是awk
和sed
的新手。我之前尝试使用sed
,但最终决定使用awk
。但我坚持要达到理想的输出。感谢您的帮助。
答案 0 :(得分:0)
Pure bash解决方案:
while read -r fname; do
while read -r line
do
if [[ "$line" = "===" ]]; then
track=1
continue
elif [[ "$line" = "###" ]]; then
track=0
echo "$fname,$t" | sed 's/\*\*\*/,/g'
t=''
fi
if [[ "$track" = 1 ]]; then
t="$t$line"
fi
done < "$fname"
done < Flist.txt > output.csv
输出:
This is section 2,This is section 3
This is section 6,This is section 7
我们正在做的就是在读取文件时查找开始和结束标记,并在遇到它们时设置标记。在===
之后,我们将所有内容添加到变量中,直到达到###
。在遇到###
时,可以打印变量,将***
替换为,
以进行拆分。
答案 1 :(得分:0)
#!/usr/bin/env bash
while read -r filename; do
awk -v OFS=',' '/^[=]+/{start=1; printf FILENAME; next}/^[*]+/{next}/^[#]+/{start=0;print ""}start{printf OFS $0}' "$filename"
done < "Flist.txt" > outfile.csv
<强>解释强>
awk -v OFS=',' '/^[=]+/{ # search for line starts with =
start=1; # set variable start = 1
printf FILENAME; # print filename
next # go to next line
}
/^[*]+/{ # if line starts with *
next # skip, go to next line
}
/^[#]+/{ # if line starts with #
start=0; # end of search, make start=0
print "" # print newline char
}
start{ # as long as start is non-zero
printf OFS $0 # print output field separator
# and current line/record/row
}
' "$filename"
/^[#]+/
^
断言字符串开头的位置匹配[#]+
下面列表中的单个字符+
量词 - 在一次和无限次之间匹配,尽可能多次,根据需要回馈(贪婪)
/^[*]+/
和/^[=]+/
与上面的regexp相似
测试结果:
$ cat infile.txt
This is section 1
===
This is section 2
***
This is section 3
###
This is section 4
This is section 5
===
This is section 6
***
This is section 7
###
This is section 8
$ awk -v OFS=',' '/^[=]+/{start=1; printf FILENAME; next}/^[*]+/{next}/^[#]+/{start=0;print ""}start{printf OFS $0}' infile.txt
infile.txt,This is section 2,This is section 3
infile.txt,This is section 6,This is section 7
答案 2 :(得分:0)
关注awk
也可以帮助您。
awk '/===/{flag=1;next} /###/{flag="";print FILENAME","val;val="";next} flag && !/^\*\*\*/{val=val?val OFS $0:$0}' Input_file
也添加非单线形式的解决方案。
awk '
/===/{ ##checking if a line has === in a line, if condition satisfies then do following.
flag=1; ##Setting variable named flag to 1.
next ##using next keyword to skip all further statements.
}
/###/{ ##Checking if a line has string ### if yes then do following.
flag=""; ##Making variable flag as NULL here.
print FILENAME","val; ##Printing Input_file name here by FILENAME and then comma and then value of variable named val.
val=""; ##Nullifying the variable val here.
next ##Using next will skip all further statements here.
}
flag && !/^\*\*\*/{ ##Checking conditions here if variable flag value is NOT NULL and checking if any line is NOT starting from *** then do following.
val=val?val OFS $0:$0 ##create a variable named val whose value will be concatenating its own value(if it is NOT NULL) and will be equal to $0 if NULL.
}
' Input_file ##Mentioning Input_file here.
答案 3 :(得分:0)
另一个awk。
while read -r fname; do
awk -F"===|***" -v RS="###" -v OFS=","
'{gsub(/\n/,"",$0)} (NF==3){$1=FILENAME;print}' "$fname"
done < Flist.txt > output.csv
我们将第一个/第二个模式设置为字段分隔符,将第三个模式设置为记录分隔符,将逗号设置为输出字段分隔符。 gsub
删除换行符,在用filename替换后,打印。添加NF==3
只是为了不打印任何最后一个不完整的记录,如示例输入中那样。