Question

我使用Apache Spark进行NLP（自然语言处理）和LDA。在运行LDA模型之前，我有一个模块调用＆＃34; CorrectEmoticon＆＃34; 我有一个字典用于此目的，这个文件看起来像：

:) \t smile
;) \t blink
...

之后，要将此词典应用到我的模块中，我尝试：

public static String correctSentence(String sentence) {
   String rplace = sentence.replaceAll("[,.]", "");
   String[] split = rplace.split(" ");
   StringBuilder sb = new StringBuilder();
   for (String inputStr : split) {
       sb.append(correctWord(inputStr));
       sb.append(" ");
   }
   return sb.toString();
}

private static String correctWord(String word) {
    word = getDefination(word);
    return word;
}
public static String getDefination(String word) {
    List<String> foundList = dict.lookup(word.trim().toLowerCase());
    if (foundList != null && !foundList.isEmpty()){
        return foundList.get(0);
    } 
    return word;
}

变量字典是：

private static JavaPairRDD<String, String> dict;

＆＃34;字典＆＃34;包含表情符号和表情符号的含义。

但是，如果我使用这个算法，它运行得非常慢。那么，你能帮我纠正这个算法来提高性能吗？非常感谢你。

Answer 1

您正在为您处理的每个令牌进行字典查找。这远非最佳。理想情况下，您可以构建基于字符的字典图并将其用作finite state automaton。例如：

这是一个有点参与的解决方案，但您也可以使用一些更简单的启发式方法来使其更快，例如创建所有第一个表情符号字符的字符集，令牌长度限制，聪明的正则表达式（如果您的表情符号）遵循模式），匹配仅限字母的标记（即显然没有表情符号）等。

如何使用Apache Spark提高检查拼写性能

1 个答案: