Question

我试图找到4个N字符串数组，在O（N * log（N））时间内至少有3个数组共有一个字符串，如果存在则返回按字典顺序排列的第一个字符串

我尝试创建一个大小为4 * N的数组，并将4个数组中的项添加到其中，同时删除重复项。然后我在大阵列上做了一个快速排序，找到第一个最终的重复。

有谁知道更好的解决方案？

Answer 1

这里我们有4个N个字符串数组，其中N = 5.我获得所有三次重复的方法是：

获取第一个数组的第一个字符串，并将其添加到Map＆lt;字符串，设置＆lt;整数＆gt; ＆GT;使用Set中的数组编号（我使用哈希，因为插入和搜索是O（1））;
获取第二个数组的第一个字符串，并将其添加到Map＆lt;字符串，设置＆lt;整数＆gt; ＆GT;使用Set中的数组编号;
重复步骤2，但使用第3和第4个数组而不是第2个;
重复步骤1,2和3，但使用第二个字符串而不是第一个字符串;
重复步骤1,2和3，但使用 3nd 字符串代替1st;
等

在最坏的情况下，我们将进行N * 4比较，O（N * log（N））。

public class Main {

    public static void main(String[] args) {
        String[][] arr = { 
                { "xxx", "xxx", "xxx", "zzz", "aaa" }, 
                { "ttt", "bbb", "ddd", "iii", "aaa" },
                { "sss", "kkk", "uuu", "rrr", "zzz" }, 
                { "iii", "zzz", "lll", "hhh", "aaa" }};

        List<String> triplicates = findTriplicates(arr);

        Collections.sort(triplicates);

        for (String word : triplicates)
            System.out.println(word);
    }

    public static List<String> findTriplicates(String[][] arr) {
        Map<String, Set<Integer>> map = new HashMap<String, Set<Integer>>();
        List<String> triplicates = new ArrayList<String>();
        final int N = 5;
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < 4; j++) {
                String str = arr[j][i];
                if (map.containsKey(str)) {
                    map.get(str).add(j);
                    if (map.get(str).size() == 3)
                        triplicates.add(str);
                } else {
                    Set<Integer> set = new HashSet<Integer>();
                    set.add(j);
                    map.put(str, set);
                }
            }
        }
        return triplicates;
    }
}

输出：

aaa
zzz

Answer 2

好的，如果你不关心常数因素，可以在O(N)中完成，其中N是字符串的大小。出于实际目的，区分字符串数量与其总大小是很重要的。（最后，我提出了一个替代版本O(N log N)，其中N是字符串比较的数量。

string -> int需要一张地图count，还需要一张临时already_counted地图string -> bool。后者基本上是一套。重要的是使用关联容器的无序/哈希版本，以避免log因素。

对于每个数组，对于每个元素，检查当前元素是否在already_counted集中。如果没有，请count[current_string] ++。在转到下一个数组之前，清空already_counted集。

现在你基本上需要一个小搜索。浏览count的每个元素，如果元素的值为3或更多，则将与其关联的键与当前的min进行比较。瞧。 min是出现3次或更多次数的最低字符串。

您不需要N log N因子，因为您不需要所有三元组，因此不需要排序或有序数据结构。 您有O(3*N) （再次N是所有字符串的总大小）。这是一个过度估计，后来我给出了更详细的估计。

现在，警告是此方法基于字符串哈希，即O(S)，其中S是字符串的大小。两次，处理每阵列重复。 因此，或者，可能更快，至少在c ++实现中，实际使用容器的有序版本。这有两个原因：

比较字符串可能会比对它们进行散列更快。如果字符串不同，那么你会得到相对较快的比较结果，而使用散列你总是遍历整个字符串，并且散列更加复杂。
它们在内存中是连续的 - 缓存友好。
哈希还有重新散列等问题。

如果字符串的数量不大，或者它们的大小非常大，我会把我的赌注押在订购的版本上。此外，如果您已经订购count，那么找到最少元素会有优势，因为它是count > 3的第1个，但在最坏的情况下，您将获得大量a*个1 }和z与3。

所以，要总结所有内容，如果我们调用n字符串比较的数量，并N字符串哈希的数量。

基于散列的方法是O(2 N + n)，有了一些技巧，你可以将常数因子降低1，例如重用count和already_checked。\的哈希值，或者通过bitset组合两种数据结构。所以你会得到O(N + n)。
基于纯字符串比较的方法为O(2 n log n + n)。也许某种程度上可以轻易地使用提示来降低常数，但我不确定。

Answer 3

您可以在O（n log n）中执行此操作，并且具有恒定的额外空间。在对各个列表进行排序后，它是标准的k-way merge problem。如果单个列表可以包含重复项，那么您需要在排序过程中删除重复项。

因此，假设您有list1，list2，list3和list4：

Sort the individual lists, removing duplicates
Create a priority queue (min-heap) of length 4
Add the first item from each list to the heap
last-key = ""
last-key-count = 0
while not done
    remove the smallest item from the min-heap
    add to the heap the next item from the list that contained the item you just removed.
    if the item matches last-key
        increment last-key-count
        if last-key-count == 3 then
            output last-key
            exit done
        else
            last-key-count = 1
            last-key = item key
end while
// if you get here, there was no triplicate item

另一种方法是将所有列表合并到一个列表中，然后对其进行排序。然后，您可以按顺序查看它，以找到第一个三份。同样，如果单个列表可以包含重复项，则应在组合列表之前将其删除。

combined = list1.concat(list2.concat(list3.concat(list4)))
last-key = ""
last-key-count = 0
for i = 0 to combined.length-1
    if combined[i] == last-key
        last-key-count++
        if last-key-count == 3
            exit done
        else
            last-key = combined[i]
            last-key-count = 1
end for
// if you get here, no triplicate was found

Answer 4

可以使用Trie在O（N）中求解。

循环4逐个列出，对于每个列表，您将字符串插入到Trie中。

当您插入字符串 s 列表 L 时，只有当存在字符串 s 时才会增加计数器在以前的列表中。如果计数器> = 3并且按字典顺序小于当前答案，则更新答案。

这是一个示例C ++代码，可以输入4个字符串列表，每个包含5个字符串来测试它。 http://ideone.com/fTmKgJ

#include<bits/stdc++.h> using namespace std; vector<vector<string>> lists; string ans = ""; struct TrieNode { TrieNode* l[128]; int n; TrieNode() { memset(l, 0, sizeof(TrieNode*) * 128); n = 0; } } *root = new TrieNode(); void add(string s, int listID) { TrieNode* p = root; for (auto x: s) { if (!p->l[x]) p->l[x] = new TrieNode(); p = p->l[x]; } p->n |= (1<<listID); if(__builtin_popcount(p->n) >= 3 && (ans == "" || s < ans)) ans = s; } int main() { for(int i=0; i<4;i++){ string s; vector<string> v; for(int i=0; i<5; i++){ cin >> s; v.push_back(s); } lists.push_back(v); } for(int i=0; i<4;i++){ for(auto s: lists[i]){ add(s, i); } } if(ans == "") cout << "NO ANSWER" << endl; else cout << ans << endl; return 0; }

在4个列表

4 个答案: