如何通过数字和单词组拆分文本

时间:2019-06-14 07:30:45

标签: java regex

假设我有一个包含  -用逗号分隔的字符串  -和文字

  my_string =  "2 Marine Cargo       14,642 10,528       16,016 more text 8,609 argA 2,106 argB"

我想将它们提取到一个数组中,该数组由“数字”和“单词组”分开

 resultArray = {"2", "Marine Cargo", "14,642", "10,528", "16,016",
                "more text", "8,609", "argA", "2,106", "argB"};

注释0:每个条目之间可能有多个空格,应将其忽略。

注释1:“船用货物”和“更多文本”没有分成不同的字符串,因为它们是一组单词,没有数字将它们分开。 而argA和argB是分开的,因为它们之间存在数字。

4 个答案:

答案 0 :(得分:3)

您可以尝试使用此正则表达式进行拆分

([\d,]+|[a-zA-Z]+ *[a-zA-Z]*) //note the spacing between + and *.
  • [0-9,] + //将搜索一个或多个数字和逗号
  • [a-zA-Z] + [a-zA-Z] //将搜索一个单词,然后搜索空格(如果有),然后搜索另一个单词(如果任何)。

    String regEx = "[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*";
    

您这样使用它们

public static void main(String args[]) {

  String input = new String("2 Marine Cargo       14,642 10,528       16,016 more text 8,609 argA 2,106 argB");
  System.out.println("Return Value :" );      

  Pattern pattern = Pattern.compile("[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*");

  ArrayList<String> result = new ArrayList<String>();
  Matcher m = pattern.matcher(input);
  while (m.find()) { 
         System.out.println(">"+m.group(0)+"<");  
         result.add(m.group(0));

   }
}

以下是从https://regex101.com自动生成的RegEx的输出以及详细说明。

enter image description here

1st Alternative [0-9,]+
Match a single character present in the list below [0-9,]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
, matches the character , literally (case sensitive)


2nd Alternative [a-zA-Z]+ *[a-zA-Z]*
Match a single character present in the list below [a-zA-Z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
 * matches the character   literally (case sensitive)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [a-zA-Z]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)

答案 1 :(得分:1)

如果空格是您的问题。 String#split使用正则表达式作为参数。然后,您可以这样做: my_list = Arrays.asList(my_string.split("\s?"));

但是,这并不能解决所有问题,如评论中提到的那样。

答案 2 :(得分:1)

您可以这样做:

    List<String> strings = new ArrayList<>();
    String prev = null;
    for (String w: my_string.split("\\s+")) {
        if (w.matches("\\d+(?:,\\d+)?")) {
            if (prev != null) {
                strings.add(prev);
                prev = null;
            }
            strings.add(w);
        } else if (prev == null) {
            prev = w;
        } else {
            prev += " " + w;
        }
    }
    if (prev != null) {
        strings.add(prev);
    }

答案 3 :(得分:1)

我喜欢Angel Koh solution,并希望添加它。仅当数字部分由一或两个部分组成时,他的解决方案才会匹配。

如果您还想捕获由三个或更多部分组成的部分,则必须稍微修改一下正则表达式为:([\d,]+|[a-zA-Z]+(?: *[a-zA-Z])*)
非捕获组(?: *[a-zA-Z])重复无限次(如果需要),并将捕获所有纯数字部分。