Question

我忙于学校项目。使用tesseract，我从图片中提取数字。我得到的输出可能是这样的：

7586630342033088866

我需要的是从63或62开始提取每4位数。

所以在这个例子中它应该是6303。如果我得到更长的数字，如：

7586630342033088866234

输出应该是 6303 6234

我想在终端脚本中执行此操作，因为我下载了我的图片，预处理并在终端中使用单个脚本运行tesseract。

我用sed和awk尝试过一些东西，但没有成功。

这是我已经使用过的脚本的结尾。

echo "\n run tesseract"
        cd /media/nummer/tramnummerNummer
        x=0                             # set to 0 counter
        keyword='tramnummer'            # set basename for file rename
        extention='*.JPG'               # extention type of file to process
        for i in `ls $extention`        #list file by extention
        do                              # do loop
        x=`expr $x + 1`                 # increase counter

        tesseract tramnummer$x.JPG tramnummer$x -l bet -psm 6      #run tesseract on all files
        tr -d [:space:] <tramnummer$x.txt > tramnummer$x           # remove white space from tess generated files
#       sed 's/\(.\)/\1\n/g' -i tramnummer$x            # some thing i tried , it puts every number on a separate line
#       sed 's/[^6]*\(6.*\)/\1/' -i tramnummer$x        # other thing i tried, it deletes every char before encountering a 6 
        done

任何人都可以帮我解决这个问题或让我走上正轨吗？提前谢谢。

Answer 1

使用egrep -o：

s='7586630342033088866234'
echo "$s" | egrep -o '6[23][0-9]{2}'
6303
6234

Answer 2

使用它：

s='7586630342033088866234'
echo "$s" |perl -lne 'push @a,/6[23]../g;print "@a";undef @a'

test

隔离一组字符

2 个答案: