从笛卡尔积中获取不同记录的算法

时间:2018-08-13 10:40:28

标签: java mysql algorithm duplicates cartesian-product

我有两个表(例如A和B)。我的任务是将B与A同步,即,将记录添加到B(如果存在于A而不是B中);并删除B中的内容,如果B中存在但A中不存在。

A和B可以具有重复记录,因此,如果记录是A中的重复记录,则B也应该具有重复记录。 A和B中的样本数据

      **Table A**                              **Table B**
    id    identifier                      id       identifier
    100   capital                         1001     bat
    201   bat                             1002     bat
    202   bat                             1003     bat
                                          5010     keyboard

为此,我已经使用外部联接从A和B中获取了记录,这样我的输出看起来像:

    A.id  B.id   identifier
    100   null    capital
    201   1001    bat
    201   1002    bat   
    201   1003    bat
    202   1001    bat
    202   1002    bat
    202   1003    bat
    null  5010    keyboard

因此在上述情况下,100和5010分别是添加和删除候选者,这很容易弄清楚。

问题是发现1003也是删除候选对象。因为201和202分别映射到1001和1002。

我可以在数据库中执行此操作,方法是像 MYSQL: Avoiding cartesian product of repeating records when self-joining 但是由于某些限制,我只能使用外部联接以上述格式加载数据。 因此,我需要使用JAVA中的算法来完成上述操作。 预先感谢。

2 个答案:

答案 0 :(得分:0)

我最终想出了这个算法,它虽然不是很干净或很聪明,但似乎可以完成工作:

QRenderSettings

输出:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;

class SyncAlgorithm {

    static class JoinResult {
        public final Integer aId;
        public final Integer bId;
        public final String identifier;
        public JoinResult(Integer aId, Integer bId, String identifier) {
            this.aId = aId;
            this.bId = bId;
            this.identifier = identifier;
        }
    }

    public static void main(String[] args) {
        List<JoinResult> table = makeTestTable();
        System.out.println("Initial table:");
        printTable(table);
        System.out.println();

        Iterator<JoinResult> iter = table.iterator();
        // A.id values we have seen
        Map<String, Set<Integer>> aSeen = new HashMap<String, Set<Integer>>();
        // A.id values we have used
        Map<String, Set<Integer>> aUsed = new HashMap<String, Set<Integer>>();
        // B.id values we have seen
        Map<String, Set<Integer>> bUsed = new HashMap<String, Set<Integer>>();
        // Loop over table to remove unnecessary rows
        while (iter.hasNext()) {
            JoinResult row = iter.next();
            // Make sure sets exist for current identifier
            if (!aSeen.containsKey(row.identifier)) {
                aSeen.put(row.identifier, new HashSet<Integer>());
            }
            if (!aUsed.containsKey(row.identifier)) {
                aUsed.put(row.identifier, new HashSet<Integer>());
            }
            if (!bUsed.containsKey(row.identifier)) {
                bUsed.put(row.identifier, new HashSet<Integer>());
            }
            // If there is no match in A remove
            if (row.aId == null) {
                iter.remove();
            // If both A.id and B.id are note null
            } else if (row.bId != null) {
                // Mark A.id as seen
                aSeen.get(row.identifier).add(row.aId);
                // If A.id or B.id were already used discard row
                if (aUsed.get(row.identifier).contains(row.aId) || bUsed.get(row.identifier).contains(row.bId)) {
                    iter.remove();
                // If both ids are new mark them as used and keep the row
                } else {
                    aUsed.get(row.identifier).add(row.aId);
                    bUsed.get(row.identifier).add(row.bId);
                }
            // If A.id is not null but B.id is null save A.id and keep the row
            } else {
                aSeen.get(row.identifier).add(row.aId);
                aUsed.get(row.identifier).add(row.aId);
            }
        }
        // Add A.id values without that have been seen but not used
        for (Map.Entry<String, Set<Integer>> aSeenEntry : aSeen.entrySet())
        {
            Set<Integer> aSeenId = aSeenEntry.getValue();
            aSeenId.removeAll(aUsed.get(aSeenEntry.getKey()));
            for (Integer aId : aSeenId) {
                table.add(new JoinResult(aId, null, aSeenEntry.getKey()));
            }
        }

        System.out.println("Result table:");
        printTable(table);
    }

    static List<JoinResult> makeTestTable() {
        List<JoinResult> table = new ArrayList<JoinResult>();
        table.add(new JoinResult(100, null, "capital"));
        table.add(new JoinResult(201, 1001, "bat"));
        table.add(new JoinResult(201, 1002, "bat"));
        table.add(new JoinResult(201, 1003, "bat"));
        table.add(new JoinResult(202, 1001, "bat"));
        table.add(new JoinResult(202, 1002, "bat"));
        table.add(new JoinResult(202, 1003, "bat"));
        table.add(new JoinResult(null, 5010, "keyboard"));
        table.add(new JoinResult(501, 3001, "foo"));
        table.add(new JoinResult(502, 3001, "foo"));
        return table;
    }

    static void printTable(List<JoinResult> table) {
        System.out.println("A.id    B.id    identifier");
        for (JoinResult row : table) {
            System.out.printf("%-8d%-8d%s\n", row.aId, row.bId, row.identifier);
        }
    }
}

答案 1 :(得分:0)

这是我解决此问题的方法:

  1. 从表A和表B中获取数据。

  2. 表A和表B的标识符的分组数据,使用:

    Map<String, SameBucketObject>

其中键为“标识符”,SameBucketObject为:

    class SameBucketObject{
      private List<String> idsOfA;
      private List<String> idsOfB;
    // getter, setters, addToList statements  
    }

基本上,我按标识符将表A和表B的所有元素分组。

  1. 在每个存储桶中,检查A idsOfA的元素和B idsOfB的元素的计数,以及
    sizeOf(idsOfA) < sizeOf(idsOfB) -> add elements with ids in idsOfB List from Table B to Table A
    sizeOf(idsOfA) > sizeOf(idsOfB) -> delete sizeOf(idsOfA) - sizeOf(idsOfB) elements from A from last.
    sizeOf(idsOfA) = sizeOf(idsOfB) -> no action.

这种方法不占用额外的空间