Question

我想用正则表达式输出模式的重复次数。例如，将"aaad"转换为"3xad"，将"bCCCCC"转换为"b5xC"。我想在sed或awk中做到这一点。

我知道我可以通过(.)\1+进行匹配，甚至可以通过((.)\1+)进行捕获。但是，如何获得重复的时间并将该值插入正则表达式或sed或awk中的字符串中？

Answer 1

抢救Perl！

perl -pe 's/((.)\2+)/length($1) . "x$2"/ge'

-p逐行读取输入，并在处理后将其打印
s///是类似于sed的替代
/e将替换项评估为代码

例如

aaadbCCCCCxx -> 3xadb5xC2xx

Answer 2

在GNU awk中：

$ echo aaadbCCCCCxx |  awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) {
        c=$i
        match(substr($0,i),c"+")
        b=b (RLENGTH>1?RLENGTH "x":"") c
    }
    print b
}'
3xadb5xC2xx

如果正则表达式元字符如注释中所述作为文字字符读取，则可以尝试检测并转义它们（以下解决方案仅是定向的）：

$ echo \\\\\\..**aaadbCCCCC++xx |
awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) { 
        c=$i                               
        # print i,c                        # for debugging
        if(c~/[*.\\]/)                     # if c is a regex metachar (not complete)
            c="\\"c                        # escape it
        match(substr($0,i),c"+")           # find all c:s
        b=b (RLENGTH>1?RLENGTH "x":"") $i  # buffer to b
    }
    print b
}'
3x\2x.2x*3xadb5xC2x+2xx

Answer 3

只是为了好玩。

使用sed麻烦但可行。请注意，此示例依赖于GNU sed（：

parse.sed

/(.)\1+/ {
  : nextrepetition
  /((.)\2+)/ s//\n\1\n/             # delimit the repetition with new-lines
  h                                 # and store the delimited version
  s/^[^\n]*\n|\n[^\n]*$//g          # now remove prefix and suffix
  b charcount                       # count repetitions
  : aftercharcount                  # return here after counting
  G                                 # append the new-line delimited version

  # Reorganize pattern space to the desired format
  s/^([^\n]+)\n([^\n]*)\n(.)[^\n]+\n/\2\1x\3/

  # Run again if more repetitions exist
  /(.)\1+/b nextrepetition
}

b

# Adapted from the wc -c example in the sed manual
# Ref: https://www.gnu.org/software/sed/manual/sed.html#wc-_002dc
: charcount

s/./a/g

# Do the carry.  The t's and b's are not necessary,
# but they do speed up the thing
t a
: a;  s/aaaaaaaaaa/b/g; t b; b done
: b;  s/bbbbbbbbbb/c/g; t c; b done
: c;  s/cccccccccc/d/g; t d; b done
: d;  s/dddddddddd/e/g; t e; b done
: e;  s/eeeeeeeeee/f/g; t f; b done
: f;  s/ffffffffff/g/g; t g; b done
: g;  s/gggggggggg/h/g; t h; b done
: h;  s/hhhhhhhhhh//g

: done

# On the last line, convert back to decimal

: loop
/a/! s/[b-h]*/&0/
s/aaaaaaaaa/9/
s/aaaaaaaa/8/
s/aaaaaaa/7/
s/aaaaaa/6/
s/aaaaa/5/
s/aaaa/4/
s/aaa/3/
s/aa/2/
s/a/1/

y/bcdefgh/abcdefg/
/[a-h]/ b loop

b aftercharcount

像这样运行它：

sed -Ef parse.sed infile

使用这样的infile：

aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

输出为：

3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

Answer 4

我希望现在可以拥有MCVE，但我们不知道到底是什么-这是我对您要执行的操作的最佳猜测：

$ cat tst.awk
{
    out = ""
    for (pos=1; pos<=length($0); pos+=reps) {
        char = substr($0,pos,1)
        for (reps=1; char == substr($0,pos+reps,1); reps++);
        out = out (reps > 1 ? reps "x" : "") char
    }
    print out
}

$ awk -f tst.awk file
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

上面是针对@Thor提供的示例输入运行的：

$ cat file
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

以上内容适用于在任何UNIX盒的任何shell中使用任何awk的任何输入字符。如果需要使其不区分大小写，只需在最里面的tolower()循环中，在比较的每一边都抛出一个for。如果您需要它在多字符字符串上工作，那么您将必须告诉我们如何确定您对子字符串感兴趣的起始/结束位置。

如何在正则表达式中输出模式的重复次数？

4 个答案: