Java SQL String删除GROUP BY中的重复项

时间:2015-09-02 21:09:20

标签: java sql string

这是关于字符串操作的挑战。

下面的语句为SQL语句形成GROUP BY。我想编写一个名为removeDuplicates()的方法来删除重复的项目。

E.g。

// Comma in quotes
String source = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', ADDRESS.CITY || ', UK'";
String expected = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US'";
String result = removeDuplicates(source);
assert result.equals(expected);

// Comma in quotes with escaped single quotes
String source = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY', ADDRESS.CITY || ', UK''s CITY'";
String expected = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);

// Comma in parentheses
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);

// Comma in parentheses with parentheses
String source = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);

// Combined
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), NAME, to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY', CITY || ', UK'";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);

我最初试图1)在逗号外面用引号(Splitting on comma outside quotes)分割字符串,2)使项目唯一,3)然后将它们连接在一起。

但是,当字符串中出现to_char(DATE,'YYYY-MM-DD')时,它不起作用。

有人能想出一些东西或建议任何有助于解决这个问题的图书馆吗?提前谢谢。

增加:

如果我们不担心子查询,那么困难的部分就是将标准分成有效的元素。修剪并使它们成为独特的IgnoreCase很容易实现。

对于拆分,我认为下面的组合应涵盖所有场景:

- split by ,
- on each element, ignore checking comma within the first ( and the last )
- on each element, ignore checking comma within the first ' and the last '

3 个答案:

答案 0 :(得分:2)

编辑:以下是更新。它已被修改为在搜索逗号时忽略引号和括号之间的所有部分之间的所有部分。它不能保证适用于任意SQL,但会传递到目前为止所描述的所有情况。

编辑:再次更新代码以忽略引号内的括号

import java.util.ArrayList;
import java.util.Iterator;

public class Main
{
    private static final String GUID = "f61916a6-3859-4cda-ae2f-209ff3802831";

    public static void main(String args[])
    {
        // Comma in quotes
        String source = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', ADDRESS.CITY || ', UK', to_char(DATE, '(YYYY)MM,DD'), to_char(DATE, '(YYYY)MM,DD')";
        String expected = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', to_char(DATE, '(YYYY)MM,DD')";
        String result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Comma in quotes with escaped single quotes
        source = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY', ADDRESS.CITY || ', UK''s CITY'";
        expected = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY'";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Comma in parentheses
        source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
        expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), CITY";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Comma in parentheses with parentheses
        source = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
        expected = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), CITY";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Combined
        source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), NAME, to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY', CITY || ', UK'";
        expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY'";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));
    }

    private static String removeDuplicates(String source)
    {
        // Replace escaped quotes with a GUID to make it easier to parse
        source = source.replace("''", GUID);

        source = source + ','; // Hacky way to get the last part to show up

        ArrayList<String> elements = new ArrayList<String>();

        ArrayList<Character> charArray = new ArrayList<Character>();

        for (char c : source.toCharArray())
            charArray.add(c);

        Iterator<Character> itr = charArray.iterator();

        // Identify all the elements
        String thusFar = "";
        while (itr.hasNext())
        {
            char next = itr.next();

            if (next == ',')
            {
                thusFar = thusFar.trim();
                if (!elements.contains(thusFar))
                    elements.add(thusFar);
                thusFar = "";
                continue;
            }

            thusFar += next;

            // Ignore anything inside quotes
            if (next == '\'')
            {
                char c;
                while ((c = itr.next()) != '\'')
                {
                    thusFar += c;
                }
                thusFar += c;
                continue;
            }

            // Ignore anything inside parentheses
            if (next == '(')
            {
                while (true)
                {
                    char c = itr.next();
                    thusFar += c;

                    if (c == ')')
                        break;

                    // Ignore anything inside quotes inside parentheses (including a close paren)
                    if (c == '\'')
                    {
                        char c2 = itr.next();
                        while (c2 != '\'')
                        {
                            thusFar += c2;
                            c2 = itr.next();
                        }
                        thusFar += c2;
                    }
                }

                continue;
            }
        }

        // Combine all the elements back together
        String result = "";

        for (String s : elements)
            result += s + ", ";

        if (result.length() > 2)
        {
            result = result.substring(0, result.length() - 2);
        }

        // Put the escaped quotes back in
        result = result.replace(GUID, "''");

        return result;
    }
}

答案 1 :(得分:1)

最好使用csv库,否则使用单引号或双引号(可以嵌套)中的逗号,转义后的引号/逗号,取消需要处理的转义的转义。
https://commons.apache.org/proper/commons-csv/

正则表达式无法处理嵌套结构。理论上不可能。

答案 2 :(得分:0)

如果你没有嵌套函数你可以简单地使用regexp来标记字符串:

/([a-z_]+\([^\(\)]*?\))|([A-Z_]+)/g

然后删除重复项。 [a-z_]+匹配函数名称,\([^\(\)]*?\)匹配函数参数 - 所有内容都执行“(”和“)”。最后一部分([A-Z]+)匹配大写字段名称。

对于提供的示例,它将生成如下内容:

NAME
to_char(DATE, 'YYYY,MM,DD')
to_char(DATE, 'YYYY-MM-DD')
NAME
CITY
to_char(DATE, 'YYYY-MM-DD')