如何修复iText中的孤立标点符号

时间:2016-03-03 21:31:46

标签: itext

我看到了 How to fix iText's text wrapping for chinese characters其他用户遇到的问题与我们面临的问题类似。 https://stackoverflow.com/users/1622493/bruno-lowagie的响应表明DefaultSplitCharacter自iText 5起考虑了中文字符。我们正在使用iText 5.5.6,但仍然看到问题。

尽管我已经知道,DefaultSplitCharacter工作正常,但问题似乎是ColumnText类允许行以这些标点符号开头。

Here's a screen shot of the PdfChunks in the BidiLine class being used to render the text

然而,结果写在第3行和第5行都以标点符号开头,如image of the PDF output

所示

我可以在适当的位置添加一些新行以使其看起来正确,但这意味着如果文本在内部重新翻译,我的修复可能不再有效。有谁知道如何确保iText不会开始使用这些标点字符?

2 个答案:

答案 0 :(得分:4)

要破解亚洲语言中的行,您需要编写自己的SplitCharacter实现。换行的一个很好的参考是Unicode® Standard Annex #14 -Unicode Line Breaking Algorithm。另一个是https://msdn.microsoft.com/en-us/library/cc194864.aspx

通过为日语实现这一点,我把我为日语文本编写的示例代码与英文文本混合在一起。使用上面的参考资料,可以很容易地为中文修改此代码。

这是一个显示正在使用的JapaneseSplitCharacter的片段:

  Chunk chunk = new Chunk(<asian text>,<asian font>);
  chunk.setSplitCharacter(JapaneseSplitCharacter.SplitCharacter);
  Paragraph paragraph = new Paragraph(chunk);  

以下是JapaneseSplitCharacter的代码:

import com.itextpdf.text.SplitCharacter;
import com.itextpdf.text.pdf.DefaultSplitCharacter;
import com.itextpdf.text.pdf.PdfChunk;

/**
 * <p/>
 * For basic latin characters spaces, periods, commas, etc. are split characters. For Japanese characters lines can break
 * anywhere, unless prohibited. This class uses logic for Japanese, non-starting and non-ending characters based on the
 * kinsoku rule and uses the DefaultSplitCharacter class for basic latin characters while writing free flowing text to a PDF.
 * <p/>
 */

public class JapaneseSplitCharacter implements SplitCharacter {

  // line of text cannot start or end with this character
  static final char u2060 = '\u2060';   //       - ZERO WIDTH NO BREAK SPACE

  // a line of text cannot start with any following characters in NOT_BEGIN_CHARACTERS[]
  static final char u30fb = '\u30fb';   //  ・   - KATAKANA MIDDLE DOT
  static final char u2022 = '\u2022';   //  •    - BLACK SMALL CIRCLE (BULLET)
  static final char uff65 = '\uff65';   //  ・    - HALFWIDTH KATAKANA MIDDLE DOT
  static final char u300d = '\u300d';   //  」   - RIGHT CORNER BRACKET
  static final char uff09 = '\uff09';   //  )   - FULLWIDTH RIGHT PARENTHESIS
  static final char u0021 = '\u0021';   //  !    - EXCLAMATION MARK
  static final char u0025 = '\u0025';   //  %    - PERCENT SIGN
  static final char u0029 = '\u0029';   //  )    - RIGHT PARENTHESIS
  static final char u002c = '\u002c';   //  ,    - COMMA
  static final char u002e = '\u002e';   //  .    - FULL STOP
  static final char u003f = '\u003f';   //  ?    - QUESTION MARK
  static final char u005d = '\u005d';   //  ]    - RIGHT SQUARE BRACKET
  static final char u007d = '\u007d';   //  }    - RIGHT CURLY BRACKET
  static final char uff61 = '\uff61';   //  。    - HALFWIDTH IDEOGRAPHIC FULL STOP
  static final char uff63 = '\uff63';   //  」    - HALFWIDTH RIGHT CORNER BRACKET
  static final char uff64 = '\uff64';   //  、    - HALFWIDTH IDEOGRAPHIC COMMA
  static final char uff67 = '\uff67';   //  ァ    - HALFWIDTH KATAKANA LETTER SMALL A
  static final char uff68 = '\uff68';   //  ィ    - HALFWIDTH KATAKANA LETTER SMALL I
  static final char uff69 = '\uff69';   //  ゥ    - HALFWIDTH KATAKANA LETTER SMALL U
  static final char uff6a = '\uff6a';   //  ェ    - HALFWIDTH KATAKANA LETTER SMALL E
  static final char uff6b = '\uff6b';   //  ォ    - HALFWIDTH KATAKANA LETTER SMALL O
  static final char uff6c = '\uff6c';   //  ャ    - HALFWIDTH KATAKANA LETTER SMALL YA
  static final char uff6d = '\uff6d';   //  ュ    - HALFWIDTH KATAKANA LETTER SMALL YU
  static final char uff6e = '\uff6e';   //  ョ    - HALFWIDTH KATAKANA LETTER SMALL YO
  static final char uff6f = '\uff6f';   //  ッ    - HALFWIDTH KATAKANA LETTER SMALL TU
  static final char uff70 = '\uff70';   //  ー    - HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
  static final char uff9e = '\uff9e';   //  ゙    - HALFWIDTH KATAKANA VOICED SOUND MARK
  static final char uff9f = '\uff9f';   //  ゚    - HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
  static final char u3001 = '\u3001';   //  、    - IDEOGRAPHIC COMMA
  static final char u3002 = '\u3002';   //  。    - IDEOGRAPHIC FULL STOP
  static final char uff0c = '\uff0c';   //  ,    - FULLWIDTH COMMA
  static final char uff0e = '\uff0e';   //  .    - FULLWIDTH FULL STOP
  static final char uff1a = '\uff1a';   //  :    - FULLWIDTH COLON
  static final char uff1b = '\uff1b';   //  ;    - FULLWIDTH SEMICOLON
  static final char uff1f = '\uff1f';   //  ?    - FULLWIDTH QUESTION MARK
  static final char uff01 = '\uff01';   //  !    - FULLWIDTH EXCLAMATION MARK
  static final char u309b = '\u309b';   //  ゛    - KATAKANA-HIRAGANA VOICED SOUND MARK
  static final char u309c = '\u309c';   //  ゜    - KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
  static final char u30fd = '\u30fd';   //  ヽ    - KATAKANA ITERATION MARK
  static final char u30fe = '\u30fe';   //  ヾ    - KATAKANA VOICED ITERATION MARK
  static final char u309d = '\u309d';   //  ゝ    - HIRAGANA ITERATION MARK
  static final char u309e = '\u309e';   //  ゞ    - HIRAGANA VOICED ITERATION MARK
  static final char u3005 = '\u3005';   //  々    - IDEOGRAPHIC ITERATION MARK
  static final char u30fc = '\u30fc';   //  ー    - KATAKANA-HIRAGANA PROLONGED SOUND MARK
  static final char u2019 = '\u2019';   //  ’    - RIGHT SINGLE QUOTATION MARK
  static final char u201d = '\u201d';   //  ”    - RIGHT DOUBLE QUOTATION MARK
  static final char u3015 = '\u3015';   //  〕    - RIGHT TORTOISE SHELL BRACKET
  static final char uff3d = '\uff3d';   //  ]    - FULLWIDTH RIGHT SQUARE BRACKET
  static final char uff5d = '\uff5d';   //  }    - FULLWIDTH RIGHT CURLY BRACKET
  static final char u3009 = '\u3009';   //  〉    - RIGHT ANGLE BRACKET
  static final char u300b = '\u300b';   //  》    - RIGHT DOUBLE ANGLE BRACKET
  static final char u300f = '\u300f';   //  』    - RIGHT WHITE CORNER BRACKET
  static final char u3011 = '\u3011';   //  】    - RIGHT BLACK LENTICULAR BRACKET
  static final char u00b0 = '\u00b0';   //  °    - DEGREE SIGN
  static final char u2032 = '\u2032';   //  ′    - PRIME
  static final char u2033 = '\u2033';   //  ″    - DOUBLE PRIME
  static final char u2103 = '\u2103';   //  ℃    - DEGREE CELSIUS
  static final char u00a2 = '\u00a2';   //  ¢    - CENT SIGN
  static final char uff05 = '\uff05';   //  %    - FULLWIDTH PERCENT SIGN
  static final char u2030 = '\u2030';   //  ‰    - PER MILLE SIGN
  static final char u3041 = '\u3041';   //  ぁ    - HIRAGANA LETTER SMALL A
  static final char u3043 = '\u3043';   //  ぃ    - HIRAGANA LETTER SMALL I
  static final char u3045 = '\u3045';   //  ぅ    - HIRAGANA LETTER SMALL U
  static final char u3047 = '\u3047';   //  ぇ    - HIRAGANA LETTER SMALL E
  static final char u3049 = '\u3049';   //  ぉ    - HIRAGANA LETTER SMALL O
  static final char u3063 = '\u3063';   //  っ    - HIRAGANA LETTER SMALL TU
  static final char u3083 = '\u3083';   //  ゃ    - HIRAGANA LETTER SMALL YA
  static final char u3085 = '\u3085';   //  ゅ    - HIRAGANA LETTER SMALL YU
  static final char u3087 = '\u3087';   //  ょ    - HIRAGANA LETTER SMALL YO
  static final char u308e = '\u308e';   //  ゎ    - HIRAGANA LETTER SMALL WA
  static final char u30a1 = '\u30a1';   //  ァ    - KATAKANA LETTER SMALL A
  static final char u30a3 = '\u30a3';   //  ィ    - KATAKANA LETTER SMALL I
  static final char u30a5 = '\u30a5';   //  ゥ    - KATAKANA LETTER SMALL U
  static final char u30a7 = '\u30a7';   //  ェ    - KATAKANA LETTER SMALL E
  static final char u30a9 = '\u30a9';   //  ォ    - KATAKANA LETTER SMALL O
  static final char u30c3 = '\u30c3';   //  ッ    - KATAKANA LETTER SMALL TU
  static final char u30e3 = '\u30e3';   //  ャ    - KATAKANA LETTER SMALL YA
  static final char u30e5 = '\u30e5';   //  ュ    - KATAKANA LETTER SMALL YU
  static final char u30e7 = '\u30e7';   //  ョ    - KATAKANA LETTER SMALL YO
  static final char u30ee = '\u30ee';   //  ヮ    - KATAKANA LETTER SMALL WA
  static final char u30f5 = '\u30f5';   //  ヵ    - KATAKANA LETTER SMALL KA
  static final char u30f6 = '\u30f6';   //  ヶ    - KATAKANA LETTER SMALL KE

  static final char[] NOT_BEGIN_CHARACTERS = new char[]{u30fb, u2022, uff65, u300d, uff09, u0021, u0025, u0029, u002c,
          u002e, u003f, u005d, u007d, uff61, uff63, uff64, uff67, uff68, uff69, uff6a, uff6b, uff6c, uff6d, uff6e,
          uff6f, uff70, uff9e, uff9f, u3001, u3002, uff0c, uff0e, uff1a, uff1b, uff1f, uff01, u309b, u309c, u30fd,
          u30fe, u309d, u309e, u3005, u30fc, u2019, u201d, u3015, uff3d, uff5d, u3009, u300b, u300f, u3011, u00b0,
          u2032, u2033, u2103, u00a2, uff05, u2030, u3041, u3043, u3045, u3047, u3049, u3063, u3083, u3085, u3087,
          u308e, u30a1, u30a3, u30a5, u30a7, u30a9, u30c3, u30e3, u30e5, u30e7, u30ee, u30f5, u30f6, u2060};

  // a line of text cannot end with any following characters in NOT_ENDING_CHARACTERS[]
  static final char u0024 = '\u0024';   //  $   - DOLLAR SIGN
  static final char u0028 = '\u0028';   //  (   - LEFT PARENTHESIS
  static final char u005b = '\u005b';   //  [   - LEFT SQUARE BRACKET
  static final char u007b = '\u007b';   //  {   - LEFT CURLY BRACKET
  static final char u00a3 = '\u00a3';   //  £   - POUND SIGN
  static final char u00a5 = '\u00a5';   //  ¥   - YEN SIGN
  static final char u201c = '\u201c';   //  “   - LEFT DOUBLE QUOTATION MARK
  static final char u2018 = '\u2018';   //   ‘  - LEFT SINGLE QUOTATION MARK
  static final char u300a = '\u300a';   //  《  - LEFT DOUBLE ANGLE BRACKET
  static final char u3008 = '\u3008';   //  〈  - LEFT ANGLE BRACKET
  static final char u300c = '\u300c';   //  「  - LEFT CORNER BRACKET
  static final char u300e = '\u300e';   //  『  - LEFT WHITE CORNER BRACKET
  static final char u3010 = '\u3010';   //  【  - LEFT BLACK LENTICULAR BRACKET
  static final char u3014 = '\u3014';   //  〔  - LEFT TORTOISE SHELL BRACKET
  static final char uff62 = '\uff62';   //  「   - HALFWIDTH LEFT CORNER BRACKET
  static final char uff08 = '\uff08';   //  (  - FULLWIDTH LEFT PARENTHESIS
  static final char uff3b = '\uff3b';   //  [  - FULLWIDTH LEFT SQUARE BRACKET
  static final char uff5b = '\uff5b';   //  {  - FULLWIDTH LEFT CURLY BRACKET
  static final char uffe5 = '\uffe5';   //  ¥  - FULLWIDTH YEN SIGN
  static final char uff04 = '\uff04';   //  $  - FULLWIDTH DOLLAR SIGN

  static final char[] NOT_ENDING_CHARACTERS = new char[]{u0024, u0028, u005b, u007b, u00a3, u00a5, u201c, u2018, u3008,
          u300a, u300c, u300e, u3010, u3014, uff62, uff08, uff3b, uff5b, uffe5, uff04, u2060};

  /**
   * An instance of the jpSplitCharacter.
   */
  public static final JapaneseSplitCharacter SplitCharacter = new JapaneseSplitCharacter();

  /**
   * An instance DefaultSplitCharacter used for BasicLatin characters.
   */
  private static final SplitCharacter defaultSplitCharacter = new DefaultSplitCharacter();

  public JapaneseSplitCharacter() { }

  /**
   * Custom method to for SplitCharacter to handle Japanese characters.
   * Returns <CODE>true</CODE> if the character can split a line. The splitting implementation
   * is free to look ahead or look behind characters to make a decision.
   *
   * @param start   the lower limit of <CODE>cc</CODE> inclusive
   * @param current the pointer to the character in <CODE>cc</CODE>
   * @param end     the upper limit of <CODE>cc</CODE> exclusive
   * @param cc      an array of characters at least <CODE>end</CODE> sized
   * @param ck      an array of <CODE>PdfChunk</CODE>. The main use is to be able to call
   *                {@link PdfChunk#getUnicodeEquivalent(int)}. It may be <CODE>null</CODE>
   *                or shorter than <CODE>end</CODE>. If <CODE>null</CODE> no conversion takes place.
   *                If shorter than <CODE>end</CODE> the last element is used
   * @return <CODE>true</CODE> if the character(s) can split a line
   */
  public boolean isSplitCharacter(int start, int current, int end, char[] cc, PdfChunk[] ck) {

    // Note: If you don't add an try/catch iText and there is an issue with isSplitCharacter() silently fails and
    // you have no idea there was a problem.
    try {

      char charCurrent = getCharacter(current, cc, ck);

      int next = current + 1;
      if (next < cc.length) {
        char charNext = getCharacter(next, cc, ck);
        for (char not_begin_character : NOT_BEGIN_CHARACTERS) {
          if (charNext == not_begin_character) {
            return false;
          }
        }
      }

      for (char not_ending_character : NOT_ENDING_CHARACTERS) {
        if (charCurrent == not_ending_character) {
          return false;
        }
      }

      boolean isBasicLatin = Character.UnicodeBlock.of(charCurrent) == Character.UnicodeBlock.BASIC_LATIN;
      if (isBasicLatin)
        return  defaultSplitCharacter.isSplitCharacter(start, current, end, cc, ck);

      return true;

    } catch (Exception ex) {
      ex.printStackTrace();
    }

    return true;
  }

  /**
   * Returns a character int the array (Note: modified from the iText default version with the addition null
   * check of '|| ck[Math.min(position, ck.length - 1)] == null'.
   *
   * @param position position in the array
   * @param ck       chunk array
   * @param cc       the character array that has to be checked
   * @return the character
   */
  protected char getCharacter(int position, char[] cc, PdfChunk[] ck) {
    if (ck == null || ck[Math.min(position, ck.length - 1)] == null) {
      return cc[position];
    }
    return (char) ck[Math.min(position, ck.length - 1)].getUnicodeEquivalent(cc[position]);
  }

}

希望这有帮助。

答案 1 :(得分:2)

我正在使用iTextSharp。 我在k.f.的样本之后写了一个ISplitCharacter。

public class CJKSplitCharacter : ISplitCharacter
{
    public static ISplitCharacter Default = new CJKSplitCharacter();
    private static ISplitCharacter defaultSplit = new DefaultSplitCharacter();

    public bool IsSplitCharacter(int start, int current, int end, char[] cc, PdfChunk[] ck)
    {
        char charCurrent = GetChar(current, cc, ck);
        int next = current + 1;
        if (next < cc.Length)
        {
            char charNext = GetChar(next, cc, ck);
            if (IsCloseChar(charNext))
            {
                return false;
            }
        }
        if (IsOpenChar(charCurrent))
        {
            return false;
        }

        // default:
        // split every CJK character

        if (Char.GetUnicodeCategory(charCurrent) == UnicodeCategory.OtherLetter) // CJK Letters
        {
            return true;
        }
        else
        {
            return defaultSplit.IsSplitCharacter(start, current, end, cc, ck);
        }
    }
    private char GetChar(int position, char[] cc, PdfChunk[] ck)
    {
        char c;
        if (ck == null || ck[Math.Min(position, ck.Length - 1)] == null)
        {
            c = cc[position];
        }
        else
        {
            c = (char)ck[Math.Min(position, ck.Length - 1)].GetUnicodeEquivalent(cc[position]);
        }
        return c;
    }

    private bool IsCloseChar(char c)
    {
        UnicodeCategory cat = Char.GetUnicodeCategory(c);
        return (cat == UnicodeCategory.ClosePunctuation         //right bracket/brace, eg: )]
            || cat == UnicodeCategory.FinalQuotePunctuation     //right quote, eg: ”
            || cat == UnicodeCategory.OtherPunctuation          //other punctuation, eg: ,。
            );
    }
    private bool IsOpenChar(char c)
    {
        UnicodeCategory cat = Char.GetUnicodeCategory(c);
        return (cat == UnicodeCategory.OpenPunctuation          //left bracket/brace, eg: ([
            || cat == UnicodeCategory.InitialQuotePunctuation   //right quote, eg: “
            );
    }
}