Question

我需要检查txt文件中的多个短语，如果文件在特定行中包含它们，请从txt fie中删除该行。

将反向grep与包含需要删除的短语的文件一起使用可以作为魅力。

问题是我需要搜索每一行的一部分，而不是整行。

我需要检查直到第10个逗号字符的部分行。如果grep在我之后找到了短语，我想保留该行，如果grep在该点之前匹配，我想删除该行。

我无法弄清楚如何使用正则表达式和短语文件。欢迎任何建议。

#!/bin/bash 

shopt -s globstar

for f in /uploads/txt/original/**/*.txt ; do

  grep -i -v -w -f phrase.txt "$f" > tmp
  mv tmp $f

done  

echo "Finished!"

修改

   # Rule to set the flag if the line needs to be printed or not
{
    ok = 1
    # loop upto tenth column
    for (i = 1; i <= 10; i++){
        # match against each pattern
        for (p in PATS) {
            if ($i ~ p) {
                ok = 0
            }
        }
    }
}

这是否意味着每列都会再次运行PATS？

是否可以将10列合并为一个字符串然后再次检查所有模式而不是针对所有模式检查10列？

Answer 1

输入数据/ tmp / test

Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
foo,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, BAR,  Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, FOO,   Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, BAR,   Val11, Val12

短语/ tmp /短语

FOO
BAR

包含评论的Awk脚本

#!/usr/bin/gawk -f

BEGIN {
    FS         = " *, *" # Field Separator regex to split words
    IGNORECASE = 1       # ignore case for regex match

    # read phrases file in an array
    # prepend '^' and append '$' to the phrase for exact match
    while (getline a < "/tmp/phrases") PATS["^"a"$"]
}

# Rule to set the flag if the line needs to be printed or not
{
    ok = 1
    # loop upto tenth column
    for (i = 1; i <= 10; i++){
        # match against each pattern
        for (p in PATS) {
            if ($i ~ p) {
                ok = 0
            }
        }
    }
}

# Rule to actual print if flag is set
ok {print}

# Debugging rule. Get rid for actual code.
END { for (p in PATS) print p }

# One liner
#  gawk 'BEGIN{FS=" *, *";IGNORECASE=1;while(getline a < "/tmp/phrases")PATS["^"a"$"]}{ok=1;for(i=1;i<=10;i++){for(p in PATS){if($i ~ p){ok=0}}}} ok {print}' /tmp/test

输出：

Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR,   Val12

归功于这个答案https://stackoverflow.com/a/14471194/2032943

GREP在使用关键字文件的部分行中

1 个答案: