Question

如何将字符串拆分为单词但保留某些短语/术语？现在，我有String[] strarr = str.split("\\b");，但我想修改regex参数以完成上面提到的内容。 解决方案不必包含正则表达式

例如，如果str等于"The city of San Francisco is truly beautiful!"且术语为"San Francisco"，那么如何拆分str以使得生成的String []数组看起来如此：["The", "city", "of", "San Francisco", "is", "truly", "beautiful!"]？

在看到@ Radiodef的评论后，我认为我本身并不需要正则表达式。如果有人可以帮我解决这个问题，仍然非常感谢帮助！

Answer 1

我知道发布的答案更好，但由于我几乎没有反对这一点，我也想分享正则表达式的答案。

因此，使用捕获组实现此目的的正确的正则表达式方法是使用此正则表达式：

([A-Z][a-z]*(?:\s?[A-Z][a-z]+)*|[a-z!]+)

<强> Working demo

匹配信息

MATCH 1
1.  [0-3]   `The`
MATCH 2
1.  [4-8]   `city`
MATCH 3
1.  [9-11]  `of`
MATCH 4
1.  [12-25] `San Francisco`
MATCH 5
1.  [26-28] `is`
MATCH 6
1.  [29-34] `truly`
MATCH 7
1.  [35-44] `beautiful!`

Java代码

String line = "The city of San Francisco is truly beautiful!";
Pattern pattern = Pattern.compile("([A-Z][a-z]*(?:\\s?[A-Z][a-z]+)*|[a-z!]+)");
Matcher matcher = pattern.matcher(line);

while (matcher.find()) {
    System.out.println("Result: " + matcher.group(1));
}

Answer 2

这是一个非常有趣的问题。我的方法是编写一个通用方法，通过返回一个简单的字符串数组来帮助检测任意数量的单词短语。

Here is a demo

以下是方法，

 String[] find(String m[], String c[], String catchStr){

    String comp = c[0];
    ArrayList<String> list = new ArrayList<String>();
    for(int i=0;i<m.length;i++){

        boolean flag = false;

        //comparing if the substring matches or not
        if(comp.equals(m[i])){
            flag = true;
            for(int j=0;j<c.length;j++){
                //you can use equalsIgnoreCase() if you want to compare the string 
                //ignoring the case
                if(!m[i+j].equals(c[j])){
                    flag = false;
                    break;
                }
            }

        }

        if(flag){
            list.add(catchStr);
            i = i + c.length-1;
        }else{
            list.add(m[i]);
        }

    }

    //converting result into String array
    String finalArr[] = list.toArray(new String[list.size()]);

    return finalArr;

}

您可以将此功能称为

String mainStr = "The city of San Francisco is truly beautiful!";
String catchStr = "San Francisco";
String mainStrArr[] = mainStr.split(" ");
String catchStrArr[] = catchStr.split(" ");

String finalArr[] = find(mainStrArr, catchStrArr, catchStr);

Answer 3

如果旧金山是唯一的排除，那么这是有效的

    String[] a = str.split("(?<!San)\\s+(?!Francisco)");

我能找到的多个排除项的最短解决方案是

    String str = "The city of San Francisco is truly beautiful!";
    String[] exclusions = { "San Francisco", "Los Angeles" };
    List<String> l = new ArrayList<>();
    Matcher m = Pattern.compile("\\w+").matcher(str);
    while (m.find()) {
        l.add(m.group());
        for (String ex : exclusions) {
            if (str.regionMatches(m.start(), ex, 0, ex.length())) {
                l.set(l.size() - 1, ex);
                m.find();
                break;
            }
        }
    }
    System.out.println(l);

Answer 4

找到要排除的子字符串，然后暂时删除其中的空格。一旦整个字符串已经拆分，找到之前编辑的子字符串，然后通过将其替换为原始字符串来恢复其空间。

    // let's say:
    // whole = "The city of San Francisco is truly beautiful!",
    // token = "San Francisco"

    public static String[] excludeString(String whole, String token) {

        // replaces token string "San Francisco" with "SanFrancisco"
        whole = whole.replaceAll(token, token.replaceAll("\\s+", ""));

        // splits whole string using space as delimiter, place tokens in a string array
        String[] strarr = whole.split("\\s+");

        // brings "SanFrancisco" back to "San Francisco" in strarr
        Collections.replaceAll(Arrays.asList(strarr), token.replaceAll("\\s+", ""), token);

        // returns the array of strings
        return strarr;
    }

样本用法：

    public static void main(String[] args) {

        String[] arr = excludeString("The city of San Francisco is truly beautiful!", "San Francisco");
        System.out.println(Arrays.asList(arr));

    }

假设你的字符串是："The city of San Francisco is truly beautiful!"

结果将是： [The, city, of, San Francisco, is, truly, beautiful!]

在\ b上拆分字符串，但不在子字符串之间的\ b上

4 个答案: