将devnagari单词标记为字母

时间:2014-07-31 10:59:43

标签: java split word hindi

我有类似

的东西
a = "बिक्रम मेरो नाम हो"

我希望实现像Java一样的东西

a[0] = बि 
a[1] = क्र 
a[3] = म

6 个答案:

答案 0 :(得分:1)

我的代码根本没有优化,对不起,但它有效!

只需更改您要输入devnagri句子的文件的路径,它就可以正常工作。

public static void main(String[] args) throws IOException
{


    BufferedReader br = new BufferedReader(new FileReader("/home/ubuntu/Documents/trainforjava.txt"));   //PLEASE ENTER PATH HERE

     String[] devFull = new String[]{

             "अ","आ", "इ", "ई", "उ", "ऊ", "ऋ"
             , "ऌ" ,"ऍ",  "ए", "ऐ", "ऑ", "ओ", "औ",


             "क", "ख", "ग", "घ" ,"ङ",
             "च" ,"छ" ,"ज"," झ"," ञ",
             "ट","ठ", "ड"," ढ"," ण",
             "त", "थ", "द", "ध", "न",
             "प", "फ", "ब"," भ","म",
             "य", "र", "ल", "ळ",
             "व", "श" ,"ष","स" ,"ह"


        };

     String[] uniDev = new String[]
             {
                     "905","906","907","908","909","90a","90b",
                     "90c","90d","90f","910","911","913","914",
                     "915","916","917","918","919",
                     "91a","91b","91c","91d","91e",
                     "91f","920","921","922","923",
                     "924","925","926","927","928",
                     "92a","92b","92c","92d","92e",
                     "92f","930","932","933",
                     "935","936","937","938","939"
             };






     String[] devHalf = new String[]
             {
                     "$़","ऽ","$ा","$ि" ,
                     "$ी", "$ ु","$ू","$ृ","$ॄ","$ॅ",
                     "$े","$ै","$ॉ",
                     "$ो","$ौ"
             };


     String[] gujHalf = new String[]
             {

                     "$઼","ઽ","$ા","$િ"  ,
                "$ી","$ુ","$ૂ","$ૃ","$ૄ","$ૅ",
                "$ે","$ૈ","$ૉ",
                "$ો","$ૌ"


             };


    try
    {
         StringBuilder sb = new StringBuilder();
            String line = br.readLine();

            while( (line = br.readLine() ) != null)
            {
                line=line.replaceAll(" ", "");  //remove white spaces if any 
                System.out.println();
                //System.out.println(line);

                 int strLength = line.length();

                // String a = "बिक्रम मेरो नाम हो";
                 int strLen = line.length();
                 char array[] = new char[strLen];
                 String strArray1[] = new String[strLen];
                 int mark[] = new int[strLen+1];
                 String unis[]=new String[strLen];
                 int cnt=0;
                 String newCharD[]=new String [strLen];
                 String newCharG[]=new String [strLen];
                 String tempD=null;
                 String tempG=null;
                 String arr = null;
                 String next =null;
                 String temp=null;
                 String uniNext=null;
                 int hold=0;
                 int j=0;

                 for (int i=0 ; i< strLen ; i++)
                 {
                     j=i+1;
                     array[i] = line.charAt(i);

                     strArray1[i] = Character.toString(line.charAt(i));

                     if(i<(strLen-1))
                     {
                         char nbit = line.charAt(j);
                         next=Character.toString(line.charAt(j));
                         uniNext=Integer.toHexString(nbit);
                         //System.out.print("\nUninext:\t"+uniNext);
                     }
                     unis[i]=Integer.toHexString(array[i]); 
                                             mark[strLen]=1;
                     if((Arrays.asList(devFull).contains(Character.toString(array[i]))) && (!uniNext.equalsIgnoreCase("94d"))  )
                     {
                         mark[i]=1;
                     }
                     else
                     {
                         mark[i]=0;
                     }


                     //
                 //System.out.println();
                     //System.out.println ("Index = " + i + "* Char = " +array[i] + "** String =" +strArray1[i]+ "Unicode="+unis[i]+"Mark="+mark[i]);
                     //System.out.print(unis[i].toString());



                 }

                 int start=0;
                 start=0;
                 for(int l1=0;l1<=strLen;l1++)
                 {
                     //start=0;

                     if(l1==0)
                     {
                         temp=Character.toString(array[l1]);

                     }

                     else
                     {
                         if(mark[l1]==0)
                         {
                             temp=temp+Character.toString(array[l1]);
                         }
                         else
                         {
                             System.out.print(" "+temp);
                             newCharD[start]=temp;
                             start++;
                             temp=null;
                             if(l1!=strLen)
                             {
                                 temp=Character.toString(array[l1]);     
                             }

                         }
                     }
                 }


                /* for(int s=0;s<start;s++)
                 {
                     System.out.print(" "+newCharD[s]);      
                 }*/


                 for(int s=0;s<start;s++)
                 {

                 }


            }
    }
     finally {
            br.close();
        }
    //PrintStream out = new PrintStream(new //FileOutputStream("/home/ubuntu/Documents/trainforjavaoutput.txt"));
    //System.setOut(out);
}

答案 1 :(得分:0)

Java内部以UTF-16(2个字节)存储任何语言的每个字符,因此您可以安全地单独访问这些字符。

答案 2 :(得分:0)

试试这个:

             String a = "बिक्रम मेरो नाम हो";
             int strLen = a.length();
             char array[] = new char[strLen];
             String strArray1[] = new String[strLen];
             for (int i=0 ; i< strLen ; i++)
             {
                 array[i] = a.charAt(i);
                 strArray1[i] = Character.toString(a.charAt(i));
                 System.out.println ("Index = " + i + "* Char = " +array[i] + "** String =" +strArray1[i] );

             }

<强>输出:

Index = 0* Char = ब** String =ब
Index = 1* Char = ि** String =ि
Index = 2* Char = क** String =क
Index = 3* Char = ्** String =्
Index = 4* Char = र** String =र
Index = 5* Char = म** String =म
Index = 6* Char =  ** String = 
Index = 7* Char = म** String =म
Index = 8* Char = े** String =े
Index = 9* Char = र** String =र
Index = 10* Char = ो** String =ो
Index = 11* Char =  ** String = 
Index = 12* Char = न** String =न
Index = 13* Char = ा** String =ा
Index = 14* Char = म** String =म
Index = 15* Char =  ** String = 
Index = 16* Char = ह** String =ह
Index = 17* Char = ो** String =ो

注意:

为了让eclipse允许你用外来字符(印地语字母表)保存你的java程序,请执行以下操作:

转到:
&#34; Windows&gt;偏好&gt;一般&gt;内容类型&gt;文字&gt; {选择文件类型} {所选文件类型}&gt;默认编码&gt; UTF-8 &#34;然后单击更新

答案 3 :(得分:0)

你尝试过icu4j吗?

BreakIterator character instance可以将字符串拆分为字符

答案 4 :(得分:0)

在印地语中试试这个:-

    import java.io.*;
    import java.text.BreakIterator;
    import java.util.Locale;
    
    public class Test {
        public static void main(String[] args) throws IOException
        {
    
            String text = "बिक्रम मेरो नाम हो";
            Locale hindi = new Locale("hi", "IN");
            BreakIterator breaker = BreakIterator.getCharacterInstance(hindi);
            breaker.setText(text);
            int start = breaker.first();
            for (int end = breaker.next();
                 end != BreakIterator.DONE;
                 start = end, end = breaker.next()) {
                System.out.println(text.substring(start,end));
            }
        }
    }

输出:-

बि
क्र
म
 
मे
रो
 
ना
म
 
हो
<块引用>

BreakIterator Java 文档: https://docs.oracle.com/javase/tutorial/i18n/text/about.html

答案 5 :(得分:-1)

为了按字母而不是字符分割字符串,根据dvasanth的建议,您可以尝试以下方法:

     String x = "बिक्रम मेरो नाम हो";
         x=x.replaceAll(" ", ""); // Remove all spaces
         int strLength = x.length();
                 String [] letterArray = new String (strLength /2);
         String combined = "";
         for (int i=0, j=0; i < strLength ; i=i+2,j++)
         {
            strArray1[i] = Character.toString(x.charAt(i));
            if (i+1 < strLength)
            {
                strArray1[i+1] = Character.toString(x.charAt(i+1));
                combined = strArray1[i]+strArray1[i+1]; // This line provides the letters.
                           // Assumption is that each letter is 2 unicode characters long.

            }
            else
            {
                combined = strArray1[i];
            }
            letterArray [j] = combined; 
            System.out.println("Split string by letters is : "+combined);
                    System.out.println("Split string by letters in array is : "+letterArray [j]);
         }    

输出为:

Split string by letters is : बि
Split string by letters is : क्
Split string by letters is : रम
Split string by letters is : मे
Split string by letters is : रो
Split string by letters is : ना
Split string by letters is : मह
Split string by letters is : ो

注意:

为了让eclipse允许你用外来字符(印地语字母表)保存你的java程序,请执行以下操作:

转到:
&#34; Windows&gt;偏好&gt;一般&gt;内容类型&gt;文字&gt; {选择文件类型} {所选文件类型}&gt;默认编码&gt; UTF-8 &#34;然后单击更新