我在带有变音符号的数据库中有阿拉伯语文本。当我输入阿拉伯语搜索某些字符串时,它没有变音符号,它肯定与数据库字符串不匹配。它在没有变音符号的情况下正常工作。有没有办法在文本上运行diacritics ???
答案 0 :(得分:9)
有没有办法在带有变音符号的文本上运行它
不幸的是没有。像MIE说:
阿拉伯语变音符号是字符
所以据我所知,这是不可能的。
MIE的答案很难实现,如果您更改数据库中的任何内容,根本无法获得更新。
您可以查看Apache Lucene search software Library。我不确定,但看起来它可以解决你的问题。
或者您需要从数据库中删除所有变音符号,然后只需使用this one这样的小型阿拉伯语规范化器,您就可以使用或不使用变音符号进行查询:
/**
* ArabicNormalizer class
* @author Ibrabel
*/
public final class ArabicNormalizer {
private String input;
private final String output;
/**
* ArabicNormalizer constructor
* @param input String
*/
public ArabicNormalizer(String input){
this.input=input;
this.output=normalize();
}
/**
* normalize Method
* @return String
*/
private String normalize(){
//Remove honorific sign
input=input.replaceAll("\u0610", "");//ARABIC SIGN SALLALLAHOU ALAYHE WA SALLAM
input=input.replaceAll("\u0611", "");//ARABIC SIGN ALAYHE ASSALLAM
input=input.replaceAll("\u0612", "");//ARABIC SIGN RAHMATULLAH ALAYHE
input=input.replaceAll("\u0613", "");//ARABIC SIGN RADI ALLAHOU ANHU
input=input.replaceAll("\u0614", "");//ARABIC SIGN TAKHALLUS
//Remove koranic anotation
input=input.replaceAll("\u0615", "");//ARABIC SMALL HIGH TAH
input=input.replaceAll("\u0616", "");//ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
input=input.replaceAll("\u0617", "");//ARABIC SMALL HIGH ZAIN
input=input.replaceAll("\u0618", "");//ARABIC SMALL FATHA
input=input.replaceAll("\u0619", "");//ARABIC SMALL DAMMA
input=input.replaceAll("\u061A", "");//ARABIC SMALL KASRA
input=input.replaceAll("\u06D6", "");//ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
input=input.replaceAll("\u06D7", "");//ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
input=input.replaceAll("\u06D8", "");//ARABIC SMALL HIGH MEEM INITIAL FORM
input=input.replaceAll("\u06D9", "");//ARABIC SMALL HIGH LAM ALEF
input=input.replaceAll("\u06DA", "");//ARABIC SMALL HIGH JEEM
input=input.replaceAll("\u06DB", "");//ARABIC SMALL HIGH THREE DOTS
input=input.replaceAll("\u06DC", "");//ARABIC SMALL HIGH SEEN
input=input.replaceAll("\u06DD", "");//ARABIC END OF AYAH
input=input.replaceAll("\u06DE", "");//ARABIC START OF RUB EL HIZB
input=input.replaceAll("\u06DF", "");//ARABIC SMALL HIGH ROUNDED ZERO
input=input.replaceAll("\u06E0", "");//ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
input=input.replaceAll("\u06E1", "");//ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
input=input.replaceAll("\u06E2", "");//ARABIC SMALL HIGH MEEM ISOLATED FORM
input=input.replaceAll("\u06E3", "");//ARABIC SMALL LOW SEEN
input=input.replaceAll("\u06E4", "");//ARABIC SMALL HIGH MADDA
input=input.replaceAll("\u06E5", "");//ARABIC SMALL WAW
input=input.replaceAll("\u06E6", "");//ARABIC SMALL YEH
input=input.replaceAll("\u06E7", "");//ARABIC SMALL HIGH YEH
input=input.replaceAll("\u06E8", "");//ARABIC SMALL HIGH NOON
input=input.replaceAll("\u06E9", "");//ARABIC PLACE OF SAJDAH
input=input.replaceAll("\u06EA", "");//ARABIC EMPTY CENTRE LOW STOP
input=input.replaceAll("\u06EB", "");//ARABIC EMPTY CENTRE HIGH STOP
input=input.replaceAll("\u06EC", "");//ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
input=input.replaceAll("\u06ED", "");//ARABIC SMALL LOW MEEM
//Remove tatweel
input=input.replaceAll("\u0640", "");
//Remove tashkeel
input=input.replaceAll("\u064B", "");//ARABIC FATHATAN
input=input.replaceAll("\u064C", "");//ARABIC DAMMATAN
input=input.replaceAll("\u064D", "");//ARABIC KASRATAN
input=input.replaceAll("\u064E", "");//ARABIC FATHA
input=input.replaceAll("\u064F", "");//ARABIC DAMMA
input=input.replaceAll("\u0650", "");//ARABIC KASRA
input=input.replaceAll("\u0651", "");//ARABIC SHADDA
input=input.replaceAll("\u0652", "");//ARABIC SUKUN
input=input.replaceAll("\u0653", "");//ARABIC MADDAH ABOVE
input=input.replaceAll("\u0654", "");//ARABIC HAMZA ABOVE
input=input.replaceAll("\u0655", "");//ARABIC HAMZA BELOW
input=input.replaceAll("\u0656", "");//ARABIC SUBSCRIPT ALEF
input=input.replaceAll("\u0657", "");//ARABIC INVERTED DAMMA
input=input.replaceAll("\u0658", "");//ARABIC MARK NOON GHUNNA
input=input.replaceAll("\u0659", "");//ARABIC ZWARAKAY
input=input.replaceAll("\u065A", "");//ARABIC VOWEL SIGN SMALL V ABOVE
input=input.replaceAll("\u065B", "");//ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
input=input.replaceAll("\u065C", "");//ARABIC VOWEL SIGN DOT BELOW
input=input.replaceAll("\u065D", "");//ARABIC REVERSED DAMMA
input=input.replaceAll("\u065E", "");//ARABIC FATHA WITH TWO DOTS
input=input.replaceAll("\u065F", "");//ARABIC WAVY HAMZA BELOW
input=input.replaceAll("\u0670", "");//ARABIC LETTER SUPERSCRIPT ALEF
//Replace Waw Hamza Above by Waw
input=input.replaceAll("\u0624", "\u0648");
//Replace Ta Marbuta by Ha
input=input.replaceAll("\u0629", "\u0647");
//Replace Ya
// and Ya Hamza Above by Alif Maksura
input=input.replaceAll("\u064A", "\u0649");
input=input.replaceAll("\u0626", "\u0649");
// Replace Alifs with Hamza Above/Below
// and with Madda Above by Alif
input=input.replaceAll("\u0622", "\u0627");
input=input.replaceAll("\u0623", "\u0627");
input=input.replaceAll("\u0625", "\u0627");
return input;
}
/**
* @return the output
*/
public String getOutput() {
return output;
}
public static void main(String[] args) {
String test = "كَلَّا لَا تُطِعْهُ وَاسْجُدْ وَاقْتَرِبْ ۩";
System.out.println("Before: "+test);
test=new ArabicNormalizer(test).getOutput();
System.out.println("After: "+test);
}
}
答案 1 :(得分:4)
我发现这样做要好得多。所有奖励都归joop所示:
import java.text.Normalizer;
import java.text.Normalizer.Form;
/**
*
* @author Ibbtek <http://ibbtek.altervista.org/>
*/
public class ArabicDiacritics {
private String input;
private final String output;
/**
* ArabicDiacritics constructor
* @param input String
*/
public ArabicDiacritics(String input){
this.input=input;
this.output=normalize();
}
/**
* normalize Method
* @return String
*/
private String normalize(){
input = Normalizer.normalize(input, Form.NFKD)
.replaceAll("\\p{M}", "");
return input;
}
/**
* @return the output
*/
public String getOutput() {
return output;
}
public static void main(String[] args) {
String test = "كَلَّا لَا تُطِعْهُ وَاسْجُدْ وَاقْتَرِبْ ۩";
System.out.println("Before: "+test);
test=new ArabicDiacritics(test).getOutput();
System.out.println("After: "+test);
}
}
答案 2 :(得分:1)
阿拉伯语变音符号是字符,因此您可以像这样使用类似的句子:
SELECT * FROM table WHERE name LIKE 'a[cd]*b[cd]*'
这将找到'ab',它们之间有任意数量的c或d。
你可以通过在每个字母之后在方括号之间添加所有阿拉伯语变音符来实现这一点
here你可以用他们的unicode代码点找到所有这些代码。
答案 3 :(得分:0)
请看下面我创建的类它是用于android,返回spannable String。它是如此基本,并没有打扰内存消耗。你们可以优化自己。
http://freshinfresh.com/sample/ABHArabicDiacritics.java
如果你想在没有注释的情况下检查(harakath)包含阿拉伯语字符串,
ABHArabicDiacritics objSearchd = new ABHArabicDiacritics();
objSearchdobjSearch.getDiacriticinsensitive("وَ اَشْهَدُ اَنْ لا اِلهَ اِلاَّ اللَّهُ").contains("اشهد");
如果要在String中返回Highlighed或redColored搜索部分。 使用以下代码
ABHArabicDiacritics objSearch = new ABHArabicDiacritics( وَ اَشْهَدُ اَنْ لا اِلهَ اِلاَّ اللَّهُ, اشهد);
SpannableString spoutput=objSearch.getSearchHighlightedSpan();
textView.setText(spoutput);
要查看搜索文本的开头和结尾位置, 使用以下方法,
/** to serch Contains */
objSearch.isContain();//
objSearch.getSearchHighlightedSpan();
objSearch.getSearchTextStartPosition();
objSearch.getSearchTextEndPosition();
请复制共享的java类并享受。
如果你们要求,我会花更多的时间来获得更多功能。
由于
答案 4 :(得分:0)
String targetWord = "الذين"
String text = "صِرَاطَ الَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ الْمَغْضُوبِ عَلَيْهِمْ وَلَا الضَّالِّين";
char[] input = targetWord.toCharArray();
StringBuilder regex = new StringBuilder("");
for(char c : input) {
regex.append(c);
regex.append("(\\p{M})*");
}
Pattern searchRegEx = Pattern.compile(regex.toString());
Matcher m = searchRegEx.matcher(text);
if(m.find()){
i = m.start();
System.out.println("m.group(0):: " + i + " : " + m.group(0));
}
答案 5 :(得分:0)
希望不要迟到,我的问题与 OP 有点不同,我想用变音符号搜索阿拉伯文本,并想用某种颜色标记搜索到的文本,所以我需要保存索引匹配的文本。
问题是在没有变音符号的情况下规范化文本会减少文本长度,并且会得到搜索文本的不同索引。
所以,通过使用正则表达式和 SpannableString
/*
* input: input text with Arabic Diacritics Or Letters that you want to ignore while searching
* searchedWord: the word/text that you want to search in @input text
* color: used to return a the founded matches with a different Foreground color using a SpannableString
* */
public static Spannable searchArabicWithIgnoredDiacriticsOrLetters(String input, String searchedWord, int color) {
Spannable output = new SpannableString(replaceLetters(input));
StringBuilder sb = new StringBuilder();
for (char ch : replaceLetters(searchedWord).toCharArray()) {
sb.append(ch);
sb.append("[\\u0655\\u0654\\u0670\\u065F\\u065E\\u065D\\u065C\\u065B\\u065A\\u0659\\u0658\\u0657\\u0656\\u06EC\\u06EB\\u06EA\\u06E4\\u061A\\u0619\\u0618\\u0617\\u0616\\u0615\\u064B\\u064C\\u064D\\u064E\\u064F\\u0650\\u0651\\u0652\\u0653\\u06DA\\u06D6\\u06D7\\u06D8\\u06D9\\u06DB\\u06DC\\u06DF\\u06E0\\u06E1\\u06E2\\u06E3\\u06E5\\u06E6\\u06E7\\u06E8\\u06EB\\u06EC\\u06ED]*");
}
Pattern pattern = Pattern.compile(String.valueOf(sb)); // get Pattern of the Regex
Matcher matcher = pattern.matcher(input); // get Matcher of the Pattern Regex in the input text
while (matcher.find())
output.setSpan(new ForegroundColorSpan(color),
matcher.start(), matcher.end(), Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
return output;
}
public static String replaceLetters(String input) {
String output;
output = input.replaceAll("أ", "ا");
output = output.replaceAll("إ", "ا");
output = output.replaceAll("ى", "ي");
output = output.replaceAll("ة", "ه");
output = output.replaceAll("آ", "ا");
output = output.replaceAll("ٱ", "ا");
return output;
}
replaceLetters()
的另一种表示
public static String replaceLetters(String input) {
String output;
output = input.replaceAll("\\u0623", String.valueOf((char) Integer.parseInt("0627", 16))); // replace أ with ا
output = output.replaceAll("\\u0625", String.valueOf((char) Integer.parseInt("0627", 16))); // replace إ with ا
output = output.replaceAll("\\u0649", String.valueOf((char) Integer.parseInt("064A", 16))); // replace ي with ى
output = output.replaceAll("\\u0629", String.valueOf((char) Integer.parseInt("0647", 16))); // replace ة with ه
output = output.replaceAll("\\u0622", String.valueOf((char) Integer.parseInt("0627", 16))); // replace آ with ا
output = output.replaceAll("\\u0671", String.valueOf((char) Integer.parseInt("0627", 16))); // replace ٱ with ا
return output;
}
注意:您可以参考已接受的 Unicode 表示答案。