Java从句子中提取子串

时间:2017-09-19 11:16:35

标签: java

有像是,不是,不包含的单词组合。我们必须在句子中匹配这些单词,并且必须将其拆分。

输入if name is tom and age is not 45 or name does not contain tom then let me know.

预期输出:

If name is 
tom and age is not 
45 or name does not contain 
tom then let me know

我尝试下面的代码进行拆分和提取但是“is”的出现在“is not”中,我的代码无法找到:

public static void loadOperators(){
        operators.add("is");
        operators.add("is not");
        operators.add("does not contain");
    }

public static void main(String[] args) {
    loadOperators();
    for(String s : operators){
        System.out.println(str.split(s).length - 1);
    }
}

3 个答案:

答案 0 :(得分:0)

由于单词split可能会出现多次,因此无法解决您的使用案例,例如isis not是您的不同运算符。理想情况下,你会:

Iterate :
1. Find the index of the 'operator'.
2. Search for the next space _ or word.
3. Then update your string as substring from its index to length-1.

答案 1 :(得分:0)

我不完全确定你想要达到的目标,但让我们试一试。

对于您的情况,一个简单的"解决方法"可能工作得很好: 按运算符的长度对运算符进行排序。这种方式最大的匹配"将首先找到。你可以定义最大的"或者字面上是最长的字符串,或者最好是单词的数量(包含的空格数),因此is a优先于contains

你需要确保没有匹配重叠,这可以通过比较所有匹配来完成。开始和结束指数并通过某些标准丢弃重叠,例如第一场比赛胜利

答案 2 :(得分:0)

此代码执行您似乎想做的事情(或我猜想您想要做的事情):

public static void main(String[] args) {
    List<String> operators = new ArrayList<>();
    operators.add("is");
    operators.add("is not");
    operators.add("does not contain");

    String input = "if name is tom and age is not 45 or name does not contain tom then let me know.";
    List<String> output = new ArrayList<>();

    int lastFoundOperatorsEndIndex = 0; // First start at the beginning of input

    for (String operator : operators){
        int indexOfOperator = input.indexOf(operator); // Find current operator's position

        if (indexOfOperator > -1) { // If operator was found
            int thisOperatorsEndIndex = indexOfOperator + operator.length(); // Get length of operator and add it to the index to include operator
            output.add(input.substring(lastFoundOperatorsEndIndex, thisOperatorsEndIndex).trim()); // Add operator to output (and remove trailing space)
            lastFoundOperatorsEndIndex = thisOperatorsEndIndex; // Update startindex for next operator
        }
    }
    output.add(input.substring(lastFoundOperatorsEndIndex, input.length()).trim()); // Add rest of input as last entry to output

    for (String part : output) { // Output to console
        System.out.println(part);
    }
}

但它高度依赖于句子和运算符的顺序。如果我们谈论用户输入,那么任务将更多更复杂。

使用正则表达式(regExp)的更好方法是:

public static void main(String... args) {
    // Define inputs
    String input1 = "if name is tom and age is not 45 or name does not contain tom then let me know.";
    String input2 = "the name is tom and he is 22 years old but the name does not contain jack, but merry is 24 year old.";

    // Output split strings
    for (String part : split(input1)) {
        System.out.println(part.trim());
    }

    System.out.println();

    for (String part : split(input2)) {
        System.out.println(part.trim());
    }
}

private static String[] split(String input) {
    // Define list of operators - 'is not' has to precede 'is'!!
    String[] operators = { "\\sis not\\s", "\\sis\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };

    // Concatenate operators to regExp-String for search
    StringBuilder searchString = new StringBuilder();

    for (String operator : operators) {
        if (searchString.length() > 0) {
            searchString.append("|");
        }
        searchString.append(operator);
    }

    // Replace all operators by operator+\n and split resulting string at \n-character
    return input.replaceAll("(" + searchString.toString() + ")", "$1\n").split("\n");
}

注意操作员的顺序! '是'必须来 '不'或'不'将永远分裂。

您可以通过对运算符'is'使用否定前瞻来防止这种情况发生。 因此"\\sis\\s"将成为"\\sis(?! not)\\s"(读起来像:“是”,而不是“不是”)。

极简主义版本(JDK 1.6+)可能如下所示:

private static String[] split(String input) {
    String[] operators = { "\\sis(?! not)\\s", "\\sis not\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" };
    return input.replaceAll("(" + String.join("|", operators) + ")", "$1\n").split("\n");
}