我需要检查txt文件中的多个短语,如果文件在特定行中包含它们,请从txt fie中删除该行。
将反向grep与包含需要删除的短语的文件一起使用可以作为魅力。
问题是我需要搜索每一行的一部分,而不是整行。
我需要检查直到第10个逗号字符的部分行。 如果grep在我之后找到了短语,我想保留该行,如果grep在该点之前匹配,我想删除该行。
我无法弄清楚如何使用正则表达式和短语文件。欢迎任何建议。
#!/bin/bash
shopt -s globstar
for f in /uploads/txt/original/**/*.txt ; do
grep -i -v -w -f phrase.txt "$f" > tmp
mv tmp $f
done
echo "Finished!"
修改
# Rule to set the flag if the line needs to be printed or not
{
ok = 1
# loop upto tenth column
for (i = 1; i <= 10; i++){
# match against each pattern
for (p in PATS) {
if ($i ~ p) {
ok = 0
}
}
}
}
这是否意味着每列都会再次运行PATS?
是否可以将10列合并为一个字符串然后再次检查所有模式而不是针对所有模式检查10列?
答案 0 :(得分:0)
输入数据/ tmp / test
Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
foo, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, BAR, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, FOO, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, BAR, Val11, Val12
短语/ tmp /短语
FOO
BAR
包含评论的Awk脚本
#!/usr/bin/gawk -f
BEGIN {
FS = " *, *" # Field Separator regex to split words
IGNORECASE = 1 # ignore case for regex match
# read phrases file in an array
# prepend '^' and append '$' to the phrase for exact match
while (getline a < "/tmp/phrases") PATS["^"a"$"]
}
# Rule to set the flag if the line needs to be printed or not
{
ok = 1
# loop upto tenth column
for (i = 1; i <= 10; i++){
# match against each pattern
for (p in PATS) {
if ($i ~ p) {
ok = 0
}
}
}
}
# Rule to actual print if flag is set
ok {print}
# Debugging rule. Get rid for actual code.
END { for (p in PATS) print p }
# One liner
# gawk 'BEGIN{FS=" *, *";IGNORECASE=1;while(getline a < "/tmp/phrases")PATS["^"a"$"]}{ok=1;for(i=1;i<=10;i++){for(p in PATS){if($i ~ p){ok=0}}}} ok {print}' /tmp/test
输出:
Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR, Val12