将字符串拆分为数组

时间:2017-02-13 04:06:39

标签: java regex string split

我有这些字符串;

wordsExpanded="test |  is |  [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] |  test |  [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] |  [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]"

interpretation="{<number_type_2 digits> <number_type_1 digits> <number_type_0 words>}"

我需要输出的是这样的字符串;

finalOutput="test |  is | thirty four | test | 3 | 1 "

基本上,解释字符串具有确定使用哪个组所需的信息。 对于第一个,我们使用,因此正确的字符串是“(三十四)”而不是“(3 4)” 第二个是“(3)”,然后是“(1)”

到目前为止,这是我的代码;

package com.test.prova;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Prova {

    public static void main(String[] args) {
        String nlInterpretation="{<number_type_2 digits> <number_type_1 digits> <number_type_0 words>}";
        String inputText="this is 34 test 3 1";
        String grammar="test is [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] test [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]";

        List<String> matchList = new ArrayList<String>();
        Pattern regex = Pattern.compile("[^\\s\"'\\[]+|\\[([^\\]]*)\\]|'([^']*)'");
        Matcher regexMatcher = regex.matcher(grammar);
        while (regexMatcher.find()) {
            if (regexMatcher.group(1) != null) {
                matchList.add(regexMatcher.group(1));
            } else if (regexMatcher.group(2) != null) {
                matchList.add(regexMatcher.group(2));
            } else {
                matchList.add(regexMatcher.group());
            }
        } 

        String[] xx = matchList.toArray(new String[0]);
        String[] yy = inputText.split(" ");

        matchList = new ArrayList<String>();
        regex = Pattern.compile("[^<]+|<([^>]*)>");
        regexMatcher = regex.matcher(nlInterpretation);
        while (regexMatcher.find()) {
            if (regexMatcher.group(1) != null) {
                matchList.add(regexMatcher.group(1));
            }
        } 
        String[] zz = matchList.toArray(new String[0]);
        System.out.println(String.join(" | ",zz));

        for (int i=0; i<xx.length; i++) {
            if (xx[i].contains("number_type_")) {
                matchList = new ArrayList<String>();
                regex = Pattern.compile("[^\\(]+|<([^\\)]*)>.*[^<]+|<([^>]*)>");
                regexMatcher = regex.matcher(xx[i]);
                while (regexMatcher.find()) {
                    if (regexMatcher.group(1) != null) {
                        matchList.add(regexMatcher.group(1));
                    } else if (regexMatcher.group(2) != null) {
                        matchList.add(regexMatcher.group(2));
                    } else {
                        matchList.add(regexMatcher.group());
                    }
                } 
                System.out.println(String.join(" | ",matchList.toArray(new String[0])));
            }
            System.out.printf("%02d\t%s\t->%s\n", i, yy[i], xx[i]);
        }
    }
}

生成的输出如下;

number_type_2 digits | number_type_1 digits | number_type_0 words
00  this    ->test
01  is  ->is
thirty four) {<number_type_0 words>} |  3  4 ) {<number_type_0 digits>}
02  34  ->(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}
03  test    ->test
three) {<number_type_1 words>} |  3 ) {<number_type_1 digits>}
04  3   ->(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}
one) {<number_type_2 words>} |  1 ) {<number_type_2 digits>}
05  1   ->(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}

我想要的更像是这样;

number_type_2 digits | number_type_1 digits | number_type_0 words
00  this    ->test
01  is      ->is
02  34      ->thirty four
03  test    ->test
04  3       ->3
05  1       ->1

2 个答案:

答案 0 :(得分:0)

我正在编写一个解决方案,假设您的字符串interpretation格式保持不变,即{<number_type_2 digits> <number_type_1 digits> <number_type_0 words>},并且它不会更改。

我将描述 Java 7 Java 8 方法。我明确表示我的算法在指数时间中运行,这是一种直接的天真方法。我想不出在短时间内更快的事情。

让我们开始浏览代码:

Java-7风格

/*
     * STEP 1: Create a method that accepts wordsExpanded and
     * interpretation Strings
     */
    public static void parseString(String wordsExpanded, String interoperation) {
        /*
         * STEP 2: Remove leading and tailing curly braces form
         * interoperation String
         */
        interoperation= interoperation.replaceAll("\\{", "");
        interoperation = interoperation.replaceAll("\\}", "");

        /*
         * STEP 3: Split your interoperation String at '>'
         * because we need individual interoperations  like
         * "<number_type_2 words" to compare. 
         */
        String[] allInterpretations = interoperation.split(">");

        /*
         * STEP 4: Split your wordsExpanded String at '|'
         * to get each word.
         */
        String[] allWordsExpanded = wordsExpanded.split("\\|");

        /*
         * STEP 5: Create a resultant StringBuilder
         */
        StringBuilder resultBuilder = new StringBuilder();

        /*
         * STEP 6: Iterate over each words form wordsExpanded
         * after splitting.
         */
        for(String eachWordExpanded : allWordsExpanded){
            /*
             * STEP 7: Remove leading and tailing spaces
             */
            eachWordExpanded = eachWordExpanded.trim();
            /*
             * STEP 8: Remove leading and tailing curly braces
             */
            eachWordExpanded = eachWordExpanded.replaceAll("\\{", "");
            eachWordExpanded = eachWordExpanded.replaceAll("\\}", "");

            /*
             * STEP 9: Now, iterate over each interoperation.
             */
            for(String eachInteroperation : allInterpretations){
                /*
                 * STEP 10: Remove the leading and tailing spaces
                 * from each interoperations.
                 */
                eachInteroperation = eachInteroperation.trim();

                /*
                 * STEP 11: Now append '>' to end of each interoperation
                 * because we'd split each of them at '>' previously.
                 */
                eachInteroperation = eachInteroperation + ">";

                /*
                 * STEP 12: Check if each eordExpanded contains any of the
                 * interoperation. 
                 */
                if(eachWordExpanded.contains(eachInteroperation)){

                    /*
                     * STEP 13: If each interoperation contains
                     * 'word', goto STEP 14.
                     * ELSE goto STEP 18.
                     */
                    if(eachInteroperation.contains("words")){
                        /*
                         * STEP 14: Remove that interoperation from the
                         * each wordExpanded String.
                         * 
                         * Ex: if the interoperation is <number_type_2 words>
                         * and it is found in the wordExpanded, remove it.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");
                        /*
                         * STEP 15: Now change the interoperation to digits.
                         * Ex: IF the interoperation is <number_type_2 words>,
                         * change that to <number_type_2 digits> and also remove them.
                         */
                        eachInteroperation = eachInteroperation.replaceAll("words", "digits");
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");

                        /*
                         * STEP 16: Remove leading and tailing square braces
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");

                        /*
                         * STEP 17: Remove any numbers in the form ( 3 ),
                         * since we are dealing with words.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("[(0-9)+]", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("(\\s)+", " ");
                    }else{
                        /*
                         * STEP 18: Remove the interoperation just like STEP 14.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");
                        /*
                         * STEP 19: Now, change interoperations to words just like STEP 15,
                         * since we are dealing with digits here and then, remove it from the
                         * each wordExpanded String.
                         */
                        eachInteroperation = eachInteroperation.replaceAll("digits", "words");
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");

                        /*
                         * STEP 20: Remove the leading and tailing square braces.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");
                        /*
                         * STEP 21: Remove the words in the form '(thirty four)'
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("[(A-Za-z)+]", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("\\s", "");
                    }
                }else{
                    continue;
                }
            }
            /*
             * STEP 22: Build your result object
             */
            resultBuilder.append(eachWordExpanded + "|");
        }
        /*
         * FINAL RESULT
         */
        System.out.println(resultBuilder.toString());
}

等效的 Java-8 样式如下:

public static void parseString(String wordsExpanded, String interoperation) {
        interoperation= interoperation.replaceAll("\\{", "");
        interoperation = interoperation.replaceAll("\\}", "");

        String[] allInterpretations = interoperation.split(">");

        StringJoiner joiner = new StringJoiner("");
        Set<String> allInterOperations = Arrays.asList(interoperation.split(">"))
            .stream()
            .map(eachInterOperation -> {
            eachInterOperation = eachInterOperation.trim();
            eachInterOperation = eachInterOperation + ">";
            return eachInterOperation;
        }).collect(Collectors.toSet());

        String result = Arrays.asList(wordsExpanded.split("\\|"))
        .stream()
        .map(eachWordExpanded -> {
        eachWordExpanded = eachWordExpanded.trim();
        eachWordExpanded = eachWordExpanded.replaceAll("\\{", "");
        eachWordExpanded = eachWordExpanded.replaceAll("\\}", "");

        for(String eachInterOperation : allInterOperations){
            if(eachWordExpanded.contains(eachInterOperation)){
                if(eachInterOperation.contains("words")){
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachInterOperation = eachInterOperation.replaceAll("words", "digits");
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("[(0-9)+]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("(\\s)+", " ");
                }else{
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachInterOperation = eachInterOperation.replaceAll("digits", "words");
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("[(A-Za-z)+]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\s", "");
                }
            }else{
                continue;
            }
        }
        return eachWordExpanded;
    }).collect(Collectors.joining("|"));

    System.out.println(result);
}

使用不同的互操作字符串对上述方法运行以下测试:

{<number_type_2 words> <number_type_1 words> <number_type_0 words>}
{<number_type_2 digits> <number_type_1 words> <number_type_0 words>}
{<number_type_2 digits> <number_type_1 digits> <number_type_0 digits>}
{<number_type_2 words> <number_type_1 digits> <number_type_0 digits>}

会产生(Java-7 Result)

的结果
test|is|thirty four |test|three |one |
test|is|thirty four |test|three |1|
test|is|34|test|3|1|
test|is|34|test|3|one |

(Java-8结果)

test|is|thirty four|test|three|one
test|is|thirty four|test|three|1
test|is|34|test|3|1
test|is|34|test|3|one

我希望这是你想要实现的目标。

答案 1 :(得分:0)

谢谢你们, 根据Shyam的代码,我做了一些修改,使它完全返回我需要的内容。

这是我的新代码;

    public static String parseString(String grammar, String interoperation) {
        if (grammar==null || interoperation == null || interoperation.equals("{}"))
            return null;

        List<String> matchList = new ArrayList<String>();
        Pattern regex = Pattern.compile("[^\\s\"'\\[]+|\\[([^\\]]*)\\]|'([^']*)'");
        Matcher regexMatcher = regex.matcher(grammar);
        while (regexMatcher.find()) {
            if (regexMatcher.group(1) != null) {
                matchList.add(regexMatcher.group(1));
            } else if (regexMatcher.group(2) != null) {
                matchList.add(regexMatcher.group(2));
            } else {
                matchList.add(regexMatcher.group());
            }
        } 

        String[] xx = matchList.toArray(new String[0]);
        String wordsExpanded = String.join(" | ",xx);

        interoperation= interoperation.replaceAll("\\{", "")
                                        .replaceAll("\\}", "");

        Set<String> allInterOperations = Arrays.asList(interoperation.split(">"))
            .stream()
            .map(eachInterOperation -> {
            eachInterOperation = eachInterOperation.trim();
            eachInterOperation = eachInterOperation + ">";
            return eachInterOperation;
        }).collect(Collectors.toSet());

        String result = Arrays.asList(wordsExpanded.split("\\|"))
            .stream()
            .map(eachWordExpanded -> {
                eachWordExpanded = eachWordExpanded.trim();
                eachWordExpanded = eachWordExpanded.replaceAll("\\{", "");
                eachWordExpanded = eachWordExpanded.replaceAll("\\}", "");

                for(String eachInterOperation : allInterOperations){
                    if(eachWordExpanded.contains(eachInterOperation)){
                        Pattern pattern = Pattern.compile("(\\(.*?\\))\\s*(<.*?>)");
                        Matcher matcher = pattern.matcher(eachWordExpanded);
                        while (matcher.find()) {
                            if (matcher.group(2).equals(eachInterOperation)) 
                                eachWordExpanded = matcher.group(1).replaceAll("[\\(\\)]", "").trim();
                        }
                    }else{
                        continue;
                    }
                }
                return eachWordExpanded;
            }).collect(Collectors.joining("|"));

        return result;
    }   

}

输出如下;

输入:

interoperation="{<number_type_2 digits> <number_type_1 digits> <number_type_0 words>}";

grammar="test is [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] test [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]";

测试|是|三十四|测试| 3 | 1

输入:

grammar="test is [(thirty four) {<number_type_0 words>}( three  four ) {<number_type_0 digits>}] test [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]";

测试|是|三十四|测试| 3 | 1

输入:

interoperation="{<number_type_4 digits> <number_type_3 digits> <number_type_2 words> <number_type_1 words> <number_type_0 words>}";
grammar="test [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] test [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]";

测试|三十四|测试|三|一

输入:

grammar = "this is my test [(three hundred forty one) {<number_type_0 words>}( 3  4  1 ) {<number_type_0 digits>}] for [(twenty one) {<number_type_1 words>}( 2  1 ) {<number_type_1 digits>}] issues";
interoperation= "{<number_type_1 digits> <number_type_0 words>}";

这|是| my | test |三百四十一|为| 2 1 |问题