忽略JavaToken组合子解析器中的前缀

时间:2015-05-05 11:48:06

标签: scala parser-combinators

我正在尝试使用JavaToken组合子解析器来提取一个位于较大字符串中间的特定匹配(即忽略一组随机字符的随机字符)。然而,我无法让它工作,并认为我被一个贪婪的解析器和/或CRs LFs抓住了。 (前缀字符基本上可以是任何东西)。我有:

class RuleHandler extends JavaTokenParsers {

  def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\\s]*""".r

  def findX: Parser[Double] = allowedPrefixChars ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}

}

然后在我的测试用例中..

    "when looking for the X value" in {
  "must find and correctly interpret X" in {
    val testString =
      """
        |Looking (only)
        |for (x=45) within
        |this string
      """.stripMargin
    val answer = ruleHandler.parse(ruleHandler.findX, testString)
    System.out.println(" X value is : " + answer.toString)
  }
}

我认为它与this SO question类似。任何人都可以看到什么错误吗? TKS。

1 个答案:

答案 0 :(得分:2)

首先,你不应该在"\\s"内两次逃避""" """

def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\s]*?""".r

在您的情况下,它被单独解释为"\""s"s为符号,而不是\s

其次,您的allowedPrefixChars解析器包含(x=,因此它会捕获整个字符串,包括(x=,后续没有任何内容解析器。

解决方案是关于你想要的前缀更具体:

object ruleHandler extends JavaTokenParsers {

  def allowedPrefixChar: Parser[String] = """[a-zA-Z0-9=*+-/<>!\_){}~\s]""".r //no "(" here

  def findX: Parser[Double] = rep(allowedPrefixChar | "\\((?!x=)".r ) ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}

ruleHandler.parse(ruleHandler.findX, testString)
res14: ruleHandler.ParseResult[Double] = [3.11] parsed: 45.0

我已经告诉解析器忽略了(x=正在追踪"""\(x=(.*?)\)""".r.findAllMatchIn(testString).map(_.group(1).toDouble).toList res22: List[Double] = List(45.0) 只是negative lookahead

替代:

(

如果你想正确使用解析器,我建议你描述整个BNF语法(包括所有可能的)=(only)用法) - 而不仅仅是片段。例如,如果关键字"(" ~> valueName <~ "=" ~ value获取值,请在解析器中加入trait Command case class Rule(name: String, value: Double) extends Command case class Directive(name: String) extends Command class RuleHandler extends JavaTokenParsers { //why `JavaTokenParsers` (not `RegexParsers`) if you don't use tokens from Java Language Specification ? def string = """[a-zA-Z0-9*+-/<>!\_{}~\s]*""".r //it's still wrong you should use some predefined Java-like literals from **JavaToken**Parsers def rule = "(" ~> string <~ "=" ~> string <~ ")" ^^ { case name ~ num => Rule(name, num.toDouble} } def directive = "(" ~> string <~ ")" ^^ { case name => Directive(name) } def commands: Parser[Command] = repsep(rule | directive, string) } 。不要忘记scala-parser旨在为您返回AST,而不仅仅是某些匹配值。纯正的regexp更适合非结构化数据的常规匹配。

示例如何以正确的方式使用解析器(没有尝试编译):

public class MyMap {

private Map<String, ArrayList<Pair>> map = new TreeMap<String, ArrayList<Pair>>();
private String[] key;
// create special map
//key = element our array
//value = pair elemene 
//              string = word which contains in this key
//              int = count this word (which contains in this key)

public MyMap(String[] ArrayWord) {
    key = ArrayWord;
    //init arraylist
    for (int i = 0; i < ArrayWord.length; i++) {
        map.put(ArrayWord[i], new ArrayList<Pair>());
    }
    //find word which containing key
    /*
     example:
     String[] mass = {
     "f",
     "five",
     "fivetwo",
     "one",
     "onefiveone",
     "two"
     };
     map[0] => f->[]
     map[1] => five->[(f:1)]
     map[2] => fivetwo->[(f:1)(five:1)(two:1)]
     map[3] => one->[]
     map[4] => onefiveone->[(f:1)(five:1)(one:2)]
     map[5] => two->[]*/
    for (int i = 0; i < ArrayWord.length; i++) {
        for (int j = 0; j < ArrayWord.length; j++) {
            if (i != j) {
                int count = 0;
                if (ArrayWord[i].contains(ArrayWord[j])) {
                    String str = ArrayWord[i];
                    // find count word which contains in this key
                    while (str.contains(ArrayWord[j])) {
                        str = str.replaceFirst(ArrayWord[j], "-");
                        count++;
                    }
                    Pair help = new Pair(ArrayWord[j], count);
                    map.get(ArrayWord[i]).add(help);
                }

            }
        }
    }
}

public String getCompoundWord() {
    String word = "";
    //check have we compound word or not 
    if (isHaveCompoundWords()) {
        /*remove Unique Elements of the words are found*/
        deleteContainingWord();
        /* find index element*/
        int index = findIndexCompoundWord();
        //return -1 if we have no word which compound just with other words array
        try {
            word = key[findIndexCompoundWord()];
            return word;
        } catch (IndexOutOfBoundsException ex) {
            System.out.println("Have no word which compound just with other words array");
        }
    } else {
        return "Have no word which compound with other words array, just unique element";
    }

    return key[findIndexCompoundWord()];
}

private void deleteContainingWord() {
    /*
     update map
     after procedure we would have map like this           
     String[] mass = {
     "f",
     "five",
     "fivetwo",
     "one",
     "onefiveone",
     "two"
     };
     map[0] => f->[]
     map[1] => five->[(f:1)]
     map[2] => fivetwo->[(f:1)(ive:1)(two:1)]
     map[3] => one->[]
     map[4] => onefiveone->[(f:1)(ive:1)(one:2)]
     map[5] => two->[]
     */
    for (int i = 0; i < map.size(); i++) {
        if (map.get(key[i]).size() > 0) {
            ArrayList<Pair> tmp = map.get(key[i]);
            for (int j = tmp.size() - 1; j >= 0; j--) {
                for (int k = tmp.size() - 1; k >= 0; k--) {
                    if (k != j) {
                        // if word contains other word2, remove part word
                        if (tmp.get(j).getName().contains(tmp.get(k).getName())) {
                            String s = tmp.get(j).getName().replace(tmp.get(k).getName(), "");
                            tmp.get(j).setName(s);
                        }
                    }
                }
            }
            map.put(key[i], tmp);
        }
    }
}

private int findIndexCompoundWord() {
    int indexMaxCompaneWord = -1;
    int maxCompaneWordLenght = 0;
    for (int i = 0; i < map.size(); i++) {
        if (map.get(key[i]).size() > 0) {
            ArrayList<Pair> tmp = map.get(key[i]);
            int currentWordLenght = 0;
            for (int j = 0; j < tmp.size(); j++) {
                if (!tmp.get(j).getName().isEmpty()) {
                    currentWordLenght += tmp.get(j).getName().length() * tmp.get(j).getCount();
                }
            }
            if (currentWordLenght == key[i].length()) {
                if (maxCompaneWordLenght < currentWordLenght) {
                    maxCompaneWordLenght = currentWordLenght;
                    indexMaxCompaneWord = i;
                }
            }
        }
    }
    return indexMaxCompaneWord;
}

private boolean isHaveCompoundWords() {
    boolean isHaveCompoundWords = false;
    for (int i = 0; i < map.size(); i++) {
        if (map.get(key[i]).size() > 0) {
            isHaveCompoundWords = true;
            break;
        }
    }
    return isHaveCompoundWords;
}

如果您需要处理自然语言(Chomsky type-0),scalanlp或类似的东西更适合。