Question

我有两个表（例如A和B）。我的任务是将B与A同步，即，将记录添加到B（如果存在于A而不是B中）；并删除B中的内容，如果B中存在但A中不存在。

A和B可以具有重复记录，因此，如果记录是A中的重复记录，则B也应该具有重复记录。 A和B中的样本数据

      **Table A**                              **Table B**
    id    identifier                      id       identifier
    100   capital                         1001     bat
    201   bat                             1002     bat
    202   bat                             1003     bat
                                          5010     keyboard

为此，我已经使用外部联接从A和B中获取了记录，这样我的输出看起来像：

    A.id  B.id   identifier
    100   null    capital
    201   1001    bat
    201   1002    bat   
    201   1003    bat
    202   1001    bat
    202   1002    bat
    202   1003    bat
    null  5010    keyboard

因此在上述情况下，100和5010分别是添加和删除候选者，这很容易弄清楚。

问题是发现1003也是删除候选对象。因为201和202分别映射到1001和1002。

我可以在数据库中执行此操作，方法是像 MYSQL: Avoiding cartesian product of repeating records when self-joining 但是由于某些限制，我只能使用外部联接以上述格式加载数据。因此，我需要使用JAVA中的算法来完成上述操作。预先感谢。

Answer 1

我最终想出了这个算法，它虽然不是很干净或很聪明，但似乎可以完成工作：

QRenderSettings

输出：

import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;

class SyncAlgorithm {

    static class JoinResult {
        public final Integer aId;
        public final Integer bId;
        public final String identifier;
        public JoinResult(Integer aId, Integer bId, String identifier) {
            this.aId = aId;
            this.bId = bId;
            this.identifier = identifier;
        }
    }

    public static void main(String[] args) {
        List<JoinResult> table = makeTestTable();
        System.out.println("Initial table:");
        printTable(table);
        System.out.println();

        Iterator<JoinResult> iter = table.iterator();
        // A.id values we have seen
        Map<String, Set<Integer>> aSeen = new HashMap<String, Set<Integer>>();
        // A.id values we have used
        Map<String, Set<Integer>> aUsed = new HashMap<String, Set<Integer>>();
        // B.id values we have seen
        Map<String, Set<Integer>> bUsed = new HashMap<String, Set<Integer>>();
        // Loop over table to remove unnecessary rows
        while (iter.hasNext()) {
            JoinResult row = iter.next();
            // Make sure sets exist for current identifier
            if (!aSeen.containsKey(row.identifier)) {
                aSeen.put(row.identifier, new HashSet<Integer>());
            }
            if (!aUsed.containsKey(row.identifier)) {
                aUsed.put(row.identifier, new HashSet<Integer>());
            }
            if (!bUsed.containsKey(row.identifier)) {
                bUsed.put(row.identifier, new HashSet<Integer>());
            }
            // If there is no match in A remove
            if (row.aId == null) {
                iter.remove();
            // If both A.id and B.id are note null
            } else if (row.bId != null) {
                // Mark A.id as seen
                aSeen.get(row.identifier).add(row.aId);
                // If A.id or B.id were already used discard row
                if (aUsed.get(row.identifier).contains(row.aId) || bUsed.get(row.identifier).contains(row.bId)) {
                    iter.remove();
                // If both ids are new mark them as used and keep the row
                } else {
                    aUsed.get(row.identifier).add(row.aId);
                    bUsed.get(row.identifier).add(row.bId);
                }
            // If A.id is not null but B.id is null save A.id and keep the row
            } else {
                aSeen.get(row.identifier).add(row.aId);
                aUsed.get(row.identifier).add(row.aId);
            }
        }
        // Add A.id values without that have been seen but not used
        for (Map.Entry<String, Set<Integer>> aSeenEntry : aSeen.entrySet())
        {
            Set<Integer> aSeenId = aSeenEntry.getValue();
            aSeenId.removeAll(aUsed.get(aSeenEntry.getKey()));
            for (Integer aId : aSeenId) {
                table.add(new JoinResult(aId, null, aSeenEntry.getKey()));
            }
        }

        System.out.println("Result table:");
        printTable(table);
    }

    static List<JoinResult> makeTestTable() {
        List<JoinResult> table = new ArrayList<JoinResult>();
        table.add(new JoinResult(100, null, "capital"));
        table.add(new JoinResult(201, 1001, "bat"));
        table.add(new JoinResult(201, 1002, "bat"));
        table.add(new JoinResult(201, 1003, "bat"));
        table.add(new JoinResult(202, 1001, "bat"));
        table.add(new JoinResult(202, 1002, "bat"));
        table.add(new JoinResult(202, 1003, "bat"));
        table.add(new JoinResult(null, 5010, "keyboard"));
        table.add(new JoinResult(501, 3001, "foo"));
        table.add(new JoinResult(502, 3001, "foo"));
        return table;
    }

    static void printTable(List<JoinResult> table) {
        System.out.println("A.id    B.id    identifier");
        for (JoinResult row : table) {
            System.out.printf("%-8d%-8d%s\n", row.aId, row.bId, row.identifier);
        }
    }
}

Answer 2

这是我解决此问题的方法：

从表A和表B中获取数据。
表A和表B的标识符的分组数据，使用：

    Map<String, SameBucketObject>

其中键为“标识符”，SameBucketObject为：

    class SameBucketObject{
      private List<String> idsOfA;
      private List<String> idsOfB;
    // getter, setters, addToList statements  
    }

基本上，我按标识符将表A和表B的所有元素分组。

在每个存储桶中，检查A idsOfA的元素和B idsOfB的元素的计数，以及

    sizeOf(idsOfA) < sizeOf(idsOfB) -> add elements with ids in idsOfB List from Table B to Table A
    sizeOf(idsOfA) > sizeOf(idsOfB) -> delete sizeOf(idsOfA) - sizeOf(idsOfB) elements from A from last.
    sizeOf(idsOfA) = sizeOf(idsOfB) -> no action.

这种方法不占用额外的空间

从笛卡尔积中获取不同记录的算法

2 个答案: