How to remove \u200B (Zero Length Whitespace Unicode Character) from String in Java?

时间:2017-03-22 18:49:42

标签: java regex string unicode outlook

My application is using Spring Integration for email polling from Outlook mailbox.

As, it is receiving the String (email body)from an external system (Outlook), So I have no control over it.

For Example,

String emailBodyStr= "rejected by sundar14-\u200B.";

Now I am trying to remove the unicode character \u200B from this String.

What I tried already.

Try#1:

emailBodyStr = emailBodyStr.replaceAll("\u200B", "");

Try#2:

`emailBodyStr = emailBodyStr.replaceAll("\u200B", "").trim();`

Try#3 (using Apache Commons):

StringEscapeUtils.unescapeJava(emailBodyStr);

Try#4:

StringEscapeUtils.unescapeJava(emailBodyStr).trim();

Nothing worked till now.

When I tried to print this String using below code.

logger.info("Comment BEFORE:{}",emailBodyStr);
logger.info("Comment AFTER :{}",emailBodyStr);

In Eclipse console, it is NOT printing unicode char,

Comment BEFORE:rejected by sundar14-​.

But the same code prints the unicode char in Linux console as below.

Comment BEFORE:rejected by sundar14-\u200B.

I read some examples where str.replace() is recommended, but please note that examples uses javascript, PHP and not Java.

1 个答案:

答案 0 :(得分:7)

最后,我可以删除' Zero Width Space'使用' Unicode Regex'。

String plainEmailBody = new String();
plainEmailBody = emailBodyStr.replaceAll("[\\p{Cf}]", "");

参考以查找Unicode字符的类别。

  1. 来自 Java 的字符类。
  2. Java中的

    Character类列出了所有这些unicode类别。

    enter image description here

    1. 网站: http://www.fileformat.info/
    2. Character category

      1. 网站 http://www.regular-expressions.info/ => Unicode正则表达式
      2. Unicode Regex for \u200B character

        注1:当我从 Outlook电子邮件正文收到此字符串时,我的问题中列出的方法正在运行。

          

        我的应用程序正在从外部系统接收字符串   ( Outlook ),所以我无法控制它。

        注2:此SO answer帮助我了解 Unicode正则表达式