将格式化电子邮件(HTML)转换为纯文本?

时间:2013-11-11 09:01:55

标签: java html email html-parsing jsoup

我有此代码实现ParserCallback并将HTML封电子邮件转换为Plain文本。当我解析像这个=

的电子邮件正文时,此代码工作正常
  "DO NOT REPLY TO THIS EMAIL MESSAGE.   <br>---------------------------------------<br>\n" +
                "nix<br>---------------------------------------<br> Esfghjdfkj\n" +
                "</blockquote></div><br><br clear=\"all\"><div><br></div>-- <br><div dir=\"ltr\"><b>Regards <br>Nisj<br>Software Engineer<br></b><div><b>Bingo</b></div></div>\n" +
                "</div>"

但是当我解析这种电子邮件正文时,它会返回null,

 email = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html charset=us-ascii\"></head><body style=\"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;\">Got it...so pls send to customer now.<div><br><div style=\"\"><div>On Nov 8, 2013, at 12:31 PM, <a href=\"mailto:xxxxxxx.com\">xxxxxxx.com</a> wrote:</div><br class=\"Apple-interchange-newline\"><blockquote type=\"cite\">Forwarding test.<br>---------------------------------------<br> ABCD.</blockquote></div><br></div></body></html>";

代码:

import java.io.IOException;
import java.io.StringReader;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Attribute;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public class EmailBody {
    public static void main(String[] args) throws IOException
    {
        String email = "";

        class EmailCallback extends ParserCallback
        {
            private String body_;
            private boolean divStarted_;

            public String getBody()
            {
                return body_;
            }

            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos)
            {
                if (t.equals(Tag.DIV) && "ltr".equals(a.getAttribute(Attribute.DIR)))
                {
                    divStarted_ = true;
                }
            }

            @Override
            public void handleEndTag(Tag t, int pos)
            {
                if (t.equals(Tag.DIV))
                {
                    divStarted_ = false;
                }
            }

            @Override
            public void handleText(char[] data, int pos)
            {
                if (divStarted_)
                {
                    body_ = new String(data);
                }
            }
        }
        EmailCallback callback = new EmailCallback();
        Parser parser = new ParserDelegator();
        StringReader reader = new StringReader(email);
        parser.parse(reader, callback, true);
        reader.close();
        System.out.println(callback.getBody());
    }
}

你能说出原因,为什么会发生这种情况?

1 个答案:

答案 0 :(得分:1)

您的代码只会从DIV元素中获取元素文本,这些元素的dir属性值为ltr。如果handleText标志为true,divStarted_方法将仅处理元素文本,仅当handleStartTag将此标志设置为true时才会发生。 在第一个电子邮件示例中,您有这样的元素,在第二个电子邮件示例中,您没有这些元素。