Question

我正在尝试将一些文本文件标记为单词并编写此代码，它在英语中非常完美，当我在阿拉伯语中尝试它时它不起作用。我添加了UTF-8来读取阿拉伯文件。我错过了什么

public void parseFiles(String filePath) throws FileNotFoundException, IOException {
    File[] allfiles = new File(filePath).listFiles();
    BufferedReader in = null;
    for (File f : allfiles) {
        if (f.getName().endsWith(".txt")) {
            fileNameList.add(f.getName());
            Reader fstream = new InputStreamReader(new FileInputStream(f),"UTF-8"); 
           // BufferedReader br = new BufferedReader(fstream);
            in = new BufferedReader(fstream);
            StringBuilder sb = new StringBuilder();
            String s=null;
            String word = null;
            while ((s = in.readLine()) != null) {
                Scanner input = new Scanner(s);
                  while(input.hasNext()) {
                       word = input.next();
                if(stopword.isStopword(word)==true)
                {
                    word= word.replace(word, "");
                }

                //String stemmed=stem.stem (word);
                sb.append(word+"\t");
                  }
                   //System.out.print(sb);  ///here the arabic text is outputed without stopwords


            }
            String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+");   //to get individual terms

            for (String term : tokenizedTerms) {
                if (!allTerms.contains(term)) {  //avoid duplicate entry
                    allTerms.add(term);
                    System.out.print(term+"\t");  //here the problem.
                }
            }
            termsDocsArray.add(tokenizedTerms);
        }
    }

}

请有任何想法来帮助我继续。感谢

Answer 1

问题在于你的正则表达式适用于英语，但不适用于阿拉伯语，因为根据定义

[\\W&&[^\\s]

装置

// returns true if the string contains a arbitrary number of non-characters except whitespace.
\W  A non-word character other than [a-zA-Z_0-9]. (Arabic chars all satisfy this condition.)
\s  A whitespace character, short for [ \t\n\x0b\r\f]

因此，按照这种逻辑，这个正则表达式将选择阿拉伯语的所有字符。所以，当你给出

sb.toString().replaceAll("[\\W&&[^\\s]]", "")

这意味着，用“”替换所有不是空格的非单词字符。在阿拉伯语的情况下，所有字符。因此，您将遇到一个问题，即所有阿拉伯字符都被“”替换。因此没有输出。您将不得不调整此正则表达式以适用于阿拉伯语文本，或者只是将字符串与空格分开，如

sb.toString().split("\\s+")

将为您提供由空格分隔的阿拉伯语单词数组。

Answer 2

除了担心bgth的回复中的字符编码之外，对阿拉伯语进行加密还有一个额外的复杂因素，那就是单词不是空间上的空格分隔：

http://www1.cs.columbia.edu/~rambow/papers/habash-rambow-2005a.pdf

如果你不熟悉阿拉伯语，你需要阅读一些关于tolken化的方法：

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.9748

Tokenize阿拉伯语文本文件java

2 个答案: