Question

我正在开发一个Spring-MVC应用程序，我们目前正在集成OCR功能。 OCR有一种习惯，即在错误检测时以及背景中有图像时抛出野性角色。处理完图像后，我们可以获得相当好的数据，但仍然存在一些错误。我们想按如下方式处理输出

从输出字符串中删除所有单个字符。
删除除A-Z，a-z，德语字符以外的所有字符，即äöü，ÄÖÜ，ß。
空格和数字应保持不变。

代码：

  File imageFile = new File(fileLocation);

            BufferedImage img  = ImageIO.read(imageFile);
            BufferedImage blackNWhite = new BufferedImage(img.getWidth(),img.getHeight(),BufferedImage.TYPE_BYTE_BINARY);
            Graphics2D graphics = blackNWhite.createGraphics();
            graphics.drawImage(img, 0, 0, null);
            String blackAndWhiteImage =  zipLocation + String.valueOf(new BigInteger(130, random).toString(32))+".png";
            File outputfile = new File(blackAndWhiteImage);
            ImageIO.write(blackNWhite, "png", outputfile);

            ITesseract instance = new Tesseract();
            // Point to one folder above tessdata directory, must contain training data
            instance.setDatapath("/usr/share/tesseract-ocr/");
            // ISO 693-3 standard
            instance.setLanguage("deu");
            String result = instance.doOCR(outputfile);
            //System.out.println(result);
             result = result.replaceAll("\\P{ASCII}","");
            System.out.println("Result is "+result);
            return result;

谢谢。

更新

正则表达式留下的狂野字符：

 |
| '(°Ul") 
_} °
=# '
( )
...................................__+_......_._._.__._._._+._._.

Answer 1

广告。 1.
result.replaceAll("\\s[a-zA-ZöÖäÄüÜß]\\s", "");
广告。 2.
result.replaceAll("[^a-zA-ZöÖäÄüÜß]", "");

Answer 2

这是我最终用来解决这个问题的正则表达式：

result = result.replaceAll("[^a-zA-Z0-9öÖäÄüÜß@\\s]", "");

谢谢。

Java：除了a-z，数字和德语字符之外，如何删除String中的所有字符

2 个答案: