Question

我正在通过一个长管道通过bash脚本处理文本文件，并且需要一步完成：

删除一些正则表达式匹配的子字符串
将它们写入文件
继续阅读其余文本。

我可以使用任何可用于管道的东西。什么是最简单/最快的方式？

更新例如：

echo -e " apple pears banana \n kiwi ananas cocoa" | magic_script " [ab][a-z]+" removed.txt | cat

输出：

pears kiwi cocoa

removed.txt：

apple banana ananas

magic_script " [ab][a-z]+" removed.txt应该取代什么？它应该适用于任何文本和任何正则表达式。

更新2：

对于其他示例，如果regexp是/a.{2,3}/：

输出：与sed -E "s/a.{2,3}//g

的结果相似

e peba kiwi ocoa

removed.txt：与grep -Eo "a.{2,3}"

的结果相似

appl ars anan anan as c

Answer 1

使用sed可以做到这一点，但由于正则表达式和文件名不是固定的而且sed不能很好地处理shell变量，所以awk是更好的工具。我们想要运行的awk代码可能如下所示：

{
  head = ""
  tail = $0

  while(match(tail, re)) {                     # while there's a match in the
                                               # part of the line we haven't
                                               # yet inspected
    print substr(tail, RSTART, RLENGTH) > file # print the match to the
                                               # file
    head = head substr(tail, 1, RSTART - 1)    # split off the parts before
    tail = substr(tail, RSTART + RLENGTH)      # and after the match
  }
  print head tail                              # print what's left in the end
}

使用合适的参数re和file。 感谢@EdMorton ，他们指出原始代码存在问题，并提出了此修正案。

为了使这个可调用方式与你在问题中的方式一致，让我们在它周围放一个小的shell样板：

#!/bin/sh

if [ $# -ne 2 ]; then
    echo "Usage: $0 regex filename"
    exit 1
fi

awk -v re="$1" -v file="$2" '
{
  head = ""
  tail = $0

  while(match(tail, re)) {
    print substr(tail, RSTART, RLENGTH) > file
    head = head substr(tail, 1, RSTART - 1)
    tail = substr(tail, RSTART + RLENGTH)
  }
  print head tail
}'

将它放在文件magic_script，chmod +x中，然后就可以了。当然，您也可以直接将awk称为

awk -v re=' [ab][a-z]+' -v file=removed.txt '{ head = ""; tail = $0; while(match(tail, re)) { print substr(tail, RSTART, RLENGTH) > file; head = head substr(tail, 1, RSTART - 1); tail = substr(tail, RSTART + RLENGTH); } print head tail }'

Answer 2

AWK可用于此目的。

参见https://www.gnu.org/software/gawk/manual/html_node/Redirection.html 其中包含以下概念示例：

$ awk '{ print $2 > "phone-list"
>        print $1 > "name-list" }' mail-list
$ cat phone-list
-| 555-5553
-| 555-3412
…
$ cat name-list
-| Amelia
-| Anthony
…

其中mail-list填充了两列信息：第一列包含名称，第二列包含电话号码。

请参阅match(string,regex)函数（http://www.grymoire.com/Unix/Awk.html#uh-47）以捕获正则表达式，请记住$ 0表示读入的整行。此函数返回RSTART和RLENGTH，它可以与{{一起使用1}}（http://www.grymoire.com/Unix/Awk.html#uh-43）函数返回匹配的模式（如果你按行搜索，则字符串= $ 0）。

AWK的精彩介绍在这里：http://www.grymoire.com/Unix/Awk.html ......可能看起来很长但值得投资。

<强>更新

如果您实际上正在处理包含信息字段的多行，并且您并不特别在意所找到的项目是否以相同的柱状形式打印，那么以下内容将起作用：

substr(string,position,length)

如果你真的关心保留柱状表格，那么你可以使用上面注释的echo -e " apple pears banana \n kiwi ananas cocoa\n pork" | awk '{ #printf "\n" for(j=1;j<=NF;j++){ i=match($j,/[ab][a-z]+/) if(i>0){ print $j > "removed.txt" }else{ printf $j " " } } }'函数进行一点点按摩以使其恰到好处（并用{{1替换第二个printf }}）。但是，由于AWK在字段中处理，如果您想要捕获单个字段中的模式的多个实例（即，没有分隔符），则上述方法会导致问题。

更新2

这是一个更好的解决方案，可以确保找到所有匹配项，并且与字段无关：

print

输出：

printf $j " "

移除：

echo -e " apple pears banana \n kiwi ananas cocoa" |
awk '
BEGIN {
  regex="a.{2,3}";
}
{
  ibeg=1;
  imat=match(substr($0,ibeg),regex);
  after=$0;
  while (imat) {
    before = substr($0,ibeg,RSTART-1);
    pattern = substr($0,ibeg+RSTART-1,RLENGTH);
    after = substr($0,ibeg+RSTART+RLENGTH-1);
    printf before;
    print pattern >"removed.txt";
    ibeg=ibeg+RSTART+RLENGTH-1;
    imat=match(substr($0,ibeg),regex);
  }
  print after;
}
'

Answer 3

这是一个解决方案，除了删除的内容之外，保持线条完好无损：

$ echo -e "apple pears banana \n kiwi ananas cocoa" \
| awk '{ for (i=1;i<=NF;++i) { if ($i ~ /^[ab][a-z]+/) { print $i > "removed.txt"; $i=""}} print }'
 pears 
kiwi  cocoa

$ cat removed.txt 
apple
banana
ananas

Answer 4

使用GNU awk将第4个arg分割为（）：

$ cat tst.awk
{
    split($0,flds,re,seps)
    for (i=1;i in flds;i++) {
        printf "%s", flds[i]
        if (i in seps)
            print seps[i] > "removed.txt"
    }
    print ""
}

$ echo -e " apple pears banana \n kiwi ananas cocoa" | awk -v re=' [ab][a-z]+' -f tst.awk
 pears
 kiwi cocoa

$ cat removed.txt
 apple
 banana
 ananas

Bash：存储替换了子串

4 个答案: