Question

我有一个字符串格式问题，我认为最适合使用正则表达式。因此，我希望我能得到建议和帮助，将正则表达式集合在一起，以及以何种方式取消或覆盖另一个。

以下是要求：

1）我需要在标点符号前后添加一个空格，例如.，,，;，:，!， ?，-，_，...。

以下句子

＆＃34;说明：注意！你会？除了本表格10-K中通过引用明确纳入的信息外，注册人的最终委托书声明不被视为本表格10-K的一部分。＆＃34;

将是：

＆＃34;注意！你会？除了本表格10-K中明确以引用方式并入的信息外，注册人的最终委托书声明不被视为本表格10-K的一部分。＆＃34;

2）但是，我想保留数字和美元符号，例如数字：

1,000.00必须为1,000.00或如果注明为1.000,00必须保持相同而不添加空格。

相同的$ 1,000.00应该是相同的，所以$ 1,000.00。

保证数字的最简单方法是什么，同时确保以下标点符号标记.，,，;，:，!，{ {1}}，?，-，_在前后获得空格？

3）除此之外，第三个要求是确保如果你有超过3个点...，那么它们必须减少到.....，但如果你有2个点{{ 1}}它必须减少到只有一个点...。

Answer 1

这段代码写在c＃上我希望它在java上也一样

string result = Regex.Replace(input, @"([a-zA-Z0-9])(\p{P})", "$1 $2");
result = Regex.Replace(result, @"(\p{P})([a-zA-Z0-9])", "$1 $2");
//result = Regex.Replace(result, @"\s+", " ");
result = Regex.Replace(result, @"(\d)\s(\p{P})\s(\d)", "$1$2$3");
result = Regex.Replace(result, @"\.{2}", ".");
result = Regex.Replace(result, @"\.{3,}", "..");

- SJ

Answer 2

First off, thanks for the help.

    We have a few issues though, the solution from PShemo for numbers is right on! So thanks for that. Meaning the solution to remove added spaces if they are numbers.

    But we need something like that for other situations as I describe as follows.

    However the issues with the dots cancel each other. So if you try to replace a lot of dots with three dots, then great. But if you run the replacement it then gets . . .

    The code I have is as follows:

    original = original.replaceAll("([a-zA-Z0-9])(\\p{P})", "$1 $2");
            original = original.replaceAll("(\\p{P})([a-zA-Z0-9])", "$1 $2");
            original = original.replaceAll("(\\d)\\s(\\p{P})\\s(\\d)", "$1$2$3");
            original = original.replaceAll("\\.{3,}", "..");
            original = original.replaceAll("\\.{2}", ".");
            original = original.replaceAll(" %","%");
            original = original.replaceAll(" - ","-");
            original = original.replaceAll(" ' ","'");

    Problems are:

    1) Emails, http links and phone numbers get spaces on @, (, ), :, / etc.

    So ideally the p{P} is not good as we can only do : if not a http link. WE cannot do %, -, ' with space as well hence the last 3 lines to fix it back. Therefore we only want spaces on the end of questions like !, ? and period (if not abbreviation or numbers). We want spaces on commas (if not part of number formatting) and we want spaces on colon : if not part of an http URL. Hence this is the complication factor.

    2) The goal, with period/dot, is to have a space on a period that ends a sentence so "This is the end . " rather than "This is the end." But abbreviations like "U.S.A." cannot become "U . S . A ."

    3) I want that more than 3 dots (.....) become ...., more than 2 dots become one dot so ".." becomes "." but the rules above cancel one another. 

    So it looks like that to fix email (@ and dots), URLs (: / dots) we could have a rule like the one for numbers "(\\d)\\s(\\p{P})\\s(\\d)", "$1$2$3" so that eventual space is removed.

    According to the RFC 282 the rules for a correct email address is : "(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"

    Now for phone numbers, you can have the following situations:

    1)###-###-####
    2)#-###-###-####
    3)###-####
    4)##########
    5)#######
    6) (xxx) xxx-xxxx
    7) (xx) xxxx-xxxx

    And the list from the conventions here: http://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers

    The issue with phone numbers on happen if there is punctuation (as we are adding spaces) such as -, (, ), +. Other than that fine.

    I found this code on Stackoverflow for phone numbers too:


    http://stackoverflow.com/questions/3367843/phone-number-regex-for-multiple-patterns-in-java

    public int Phone(String num)
    {
        try
        {
        String expression = "^(?=.{7,32}$)(\\(?\\+?[0-9]*\\)?)?[0-9_\\- \\(\\)]*((\\s?x\\s?|ext\\s?|extension\\s?)\\d{1,5}){0,1}$";  
        CharSequence inputStr = num;  
        Pattern pattern = Pattern.compile(expression);  
        Matcher matcher = pattern.matcher(inputStr);
        int x=0,y=0;
        char[] value=num.toCharArray();
        for(int i=0;i<value.length;i++)
        {
            if(value[i]=='(')
                x++;
            if(value[i]==')'&&((value[i+1]>=48&&value[i+1]<=57)||value[i+1]=='-'))
                y++;
        }
       if(matcher.matches()&&x==y)
          return 1; //valid number
       else
          return 0; //invalid number
        }
        catch(Exception ex){return 0;}
     }



    }

This here will remove dots in acronyms but not in URIs:

http://stackoverflow.com/questions/1279110/whats-the-regex-for-removing-dots-in-acronyms-but-not-in-domain-names

----

http://stackoverflow.com/questions/17098834/split-string-with-dot-while-handling-abbreviations

How about removing dots that need to disappear with regex, and then replace rest of dots with space? Regex can look like (?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$)).

String[] data = { 
        "Hello.World", 
        "This.Is.A.Test", 
        "The.S.W.A.T.Team",
        "S.w.a.T.", 
        "S.w.a.T.1", 
        "2001.A.Space.Odyssey" };

for (String s : data) {
    System.out.println(s.replaceAll(
            "(?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))", "")
            .replace('.', ' '));
}
result

Hello World
This Is A Test
The SWAT Team
SwaT 
SwaT 1
2001 A Space Odyssey
In regex I needed to escape special meaning of dot characters. I could do it with \\. but I prefer [.].

So at canter of regex we have dot literal. Now this dot is surrounded with (?<=...) and (?=...). These are parts of look-around mechanism called look-behind and look-ahead.

Since dots that need to be removed have dot (or start of data ^) and some non-white-space \\S that is also non-digit \D character before it I can test it using (?<=(^|[.])[\\S&&\\D])[.].

Also dot that needs to be removed have also non-white-space and non-digit character and another dot (optionally end of data $) after it, which can be written as [.](?=[\\S&&\\D]([.]|$))

Depending on needs [\\S&&\\D] which beside letters also matches characters like !@#$%^&*()-_=+... can be replaced with [a-zA-Z] for only English letters, or \\p{IsAlphabetic} for all letters in Unicode.

如何在没有弄乱数字符号的情况下在标点符号的末尾和开头添加空格？

2 个答案: