Question

我面前有一个挑战。让我提出困扰我的挑战 -

有两个字典说 - D1和D2。这些词典大多数时候都有相同的键，但不能保证它总是一样的。两个字典可以表示如下 -

D1 = {[“R1”，0.7]，[“R2”，0.73]，[“R3”，1.5]，[“R4”，2.5]，[“R5”，0.12]，[“R6” ，1.9]，[“R7”，9.8]，[“R8”，6.5]，[“R9”，7.2]，[“R10”，5.6]};

D2 = {[“R1”，0.7]，[“R2”，0.8]，[“R3”，1.5]，[“R4”，3.1]，[“R5”，0.10]，[“R6” ，2.0]，[“R7”，8.0]，[“R8”，1.0]，[“R9”，0.0]，[“R10”，5.6]，[“R11”，6.23]};

在这些词典中，键是字符串数据类型，值是浮点数据类型。从物理上讲，它们是两个不同时间的系统快照。 D1比D2老。我需要根据升序中的值独立地对这些词典进行排序。完成后将这些词典更改为 -

D1 = {[“R5”，0.12]，[“R1”，0.7]，[“R2”，0.73]，[“R3”，1.5]，[“R6”，1.9]，[“R4” ，2.5]，[“R10”，5.6]，[“R8”，6.5]，[“R9”，7.2]，[“R7”，9.8]}; 和 D2 = {[“R9”，0.0]，[“R5”，0.10]，[“R1”，0.7]，[“R2”，0.8]，[“R8”，1.0]，[“R3”，1.5] ，[“R6”，2.0]，[“R4”，3.1]，[“R10”，5.6]，[“R11”，6.23]，[“R7”，8.0]};

这里将字典D1中的元素排序作为参考点。 D1的每个元素与D1中的下一个元素连接。期望识别D2中的元素，这些元素在排序后出现在参考字典D1中时已经破坏了序列。虽然确定这种元素的添加（即，不存在于D1中但是存在于D2中的密钥）到D2并且从D1中删除元素（即，密钥存在于D1中但不存在于D2中）被忽略。即它们不应该在结果中突出显示。

例如，在继续上面列出的示例时，参考D1（忽略添加和删除）在D2中打破序列的元素是 -

Breakers = {[“R9”，0.0]，[“R8”，1.0]}因为，R9已将序列从D1排序字典中的第8个索引跳转到D2排序字典中的第0个索引。类似地，R8将序列从D1排序字典中的第7个索引跳到D2排序字典中的第4个索引（所有索引都从0开始）。

注意 - [“R11”，6.23]预计不会出现在断路器列表中，因为它是D2的补充。

请建议一种最佳实现此算法的算法，因为需要对从3,256,190条记录的数据库中提取的数据执行此操作。

编程语言不用担心，如果用逻辑指导我可以承担用任何语言实现它的任务。

Answer 1

我在C＃中提出了这个算法。它非常适合您的示例数据。我还测试了3000000个随机值（因此检测到很多断路器），并且在我的笔记本电脑上完成了3.2秒（英特尔酷睿i3 2.1GHz，64位）。

我首先将您的数据放入临时词典中，因此在将其放入列表之前，我可以复制粘贴您的值。当然，您的应用程序会将它们直接放在列表中。

class Program
{
    struct SingleValue
    {
        public string Key;
        public float Value;
        public SingleValue(string key, float value)
        {
            Key = key;
            Value = value;
        }
        public override string ToString()
        {
            return string.Format("{0}={1}", Key, Value);
        }
    }

    static void Main(string[] args)
    {
        List<SingleValue> D1 = new List<SingleValue>();
        HashSet<string> D1keys = new HashSet<string>();
        List<SingleValue> D2 = new List<SingleValue>();
#if !LARGETEST
        Dictionary<string, double> D1input = new Dictionary<string, double>() { { "R1", 0.7 }, { "R2", 0.73 }, { "R3", 1.5 }, { "R4", 2.5 }, { "R5", 0.12 }, { "R6", 1.9 }, { "R7", 9.8 }, { "R8", 6.5 }, { "R9", 7.2 }, { "R10", 5.6 } };
        Dictionary<string, double> D2input = new Dictionary<string, double>() { { "R1", 0.7 }, { "R2", 0.8 }, { "R3", 1.5 }, { "R4", 3.1 }, { "R5", 0.10 }, { "R6", 2.0 }, { "R7", 8.0 }, { "R8", 1.0 }, { "R9", 0.0 }, { "R10", 5.6 }, { "R11", 6.23 } };

        // You should directly put you values into this list... I converted them from a Dictionary so I didn't have to type over your input values :)
        foreach (KeyValuePair<string, double> kvp in D1input)
        {
            D1.Add(new SingleValue(kvp.Key, (float)kvp.Value));
            D1keys.Add(kvp.Key);
        }
        foreach (KeyValuePair<string, double> kvp in D2input)
            D2.Add(new SingleValue(kvp.Key, (float)kvp.Value));
#else
        Random ran = new Random();
        for (int i = 0; i < 3000000; i++)
        {
            D1.Add(new SingleValue(i.ToString(), (float)ran.NextDouble()));
            D1keys.Add(i.ToString());
            D2.Add(new SingleValue(i.ToString(), (float)ran.NextDouble()));
        }
#endif

        // Sort the lists
        D1.Sort(delegate(SingleValue x, SingleValue y)
        {
            if (y.Value > x.Value)
                return -1;
            else if (y.Value < x.Value)
                return 1;
            return 0;
        });
        D2.Sort(delegate(SingleValue x, SingleValue y)
        {
            if (y.Value > x.Value)
                return -1;
            else if (y.Value < x.Value)
                return 1;
            return 0;
        });

        int start = Environment.TickCount;

        Dictionary<string, float> breakers = new Dictionary<string, float>();
        List<SingleValue> additions = new List<SingleValue>();

        // Walk through D1
        IEnumerator<SingleValue> i1 = D1.GetEnumerator();
        IEnumerator<SingleValue> i2 = D2.GetEnumerator();

        while (i1.MoveNext() && i2.MoveNext())
        {
            while (breakers.ContainsKey(i1.Current.Key))
            {
                if (!i1.MoveNext())
                    break;
            }

            while (i1.Current.Key != i2.Current.Key)
            {
                if (D1keys.Contains(i2.Current.Key))
                    breakers.Add(i2.Current.Key, i2.Current.Value);
                else
                    additions.Add(i2.Current);
                if (!i2.MoveNext())
                    break;
            }
        }

        int duration = Environment.TickCount - start;
        Console.WriteLine("Lookup took {0}ms", duration);
        Console.ReadKey();
    }
}

enter image description here

Answer 2

如果您可以在排序之前删除D2中的内容，那将很容易，对吧？你说你无法删除数据。但是，您可以创建一个模拟此类删除的附加数据结构（例如，向项目添加“已删除”位，或者如果您无法更改其类型，则创建一组“已删除”项）。然后运行简单算法，但请确保忽略“已删除”的项目。

Answer 3

我一直在想这个。正如你提到的Levenshtein距离，我假设你想要通过将他们从D2中的位置移动到D2中的某个位置来获得这些元素，你将从D2中以最少的移动数量获得D1（忽略不存在的元素）两个序列）。

我写了一个贪婪的算法，可能足以满足您的需求，但它可能不一定能在所有情况下都给出最佳结果。老实说，我不确定，可能会在以后（最早的周末）回来检查是否正确。但是，如果你真的需要在300万个元素的序列上做这个，我相信在这方面做任何好工作的算法都不够快，因为我看不到一个O（n）算法做了一个好的工作，即使在一些微不足道的投入上也不会失败。

此算法尝试将每个元素移动到其预期位置，并在移动后计算错误总和（每个元素距其原始位置的距离）。导致最低误差总和的元素被宣告为破坏者并被移动。重复此过程，直到序列恢复为D1。

我认为它有O（n ^ 3）复杂度，虽然元素有时需要移动多次，因此它可能是O（n ^ 4）最坏的情况，我不确定，但在100万个随机例子中有50个元素，外环运行的最大数量是51（n ^ 4意味着它可以是2500，不知怎的，我在所有百万次测试中都很幸运）。只有键，没有值。这是因为在这一步中这些值是无关紧要的，所以没有必要存储它们。

编辑：我为此编写了一个反例生成器，实际上它并不总是最优的。破坏者越多，获得最佳解决方案的可能性就越小。例如，在1000个随机移动的元素中，当最优解最多为50时，通常会找到一组55-60个断路器。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Breakers
{
    class Program
    {
        static void Main(string[] args)
        {
            //test case 1
            //List<string> L1 = new List<string> { "R5", "R1", "R2", "R3", "R6", "R4", "R10", "R8", "R9", "R7" };
            //List<string> L2 = new List<string> { "R9", "R5", "R1", "R2", "R8", "R3", "R6", "R4", "R10", "R11", "R7" };
            //GetBreakers<string>(L1, L2);

            //test case 2
            //List<string> L1 = new List<string> { "R5", "R1", "R2", "R3", "R6", "R4", "R10", "R8", "R9", "R7" };
            //List<string> L2 = new List<string> { "R5", "R9", "R1", "R6", "R2", "R3", "R4", "R10", "R8", "R7" };
            //GetBreakers<string>(L1, L2);

            //test case 3
            List<int> L1 = new List<int>();
            List<int> L2 = new List<int>();
            Random r = new Random();
            int n = 100;
            for (int i = 0; i < n; i++)
            {
                L1.Add(i);
                L2.Add(i);
            }
            for (int i = 0; i < 5; i++) // number of random moves, this is the upper bound of the optimal solution
            {
                int a = r.Next() % n;
                int b = r.Next() % n;
                if (a == b)
                {
                    i--;
                    continue;
                }
                int x = L2[a];
                Console.WriteLine(x);
                L2.RemoveAt(a);
                L2.Insert(b, x);
            }
            for (int i = 0; i < L2.Count; i++) Console.Write(L2[i]);
            Console.WriteLine();
            GetBreakers<int>(L1, L2);
        }

        static void GetBreakers<T>(List<T> L1, List<T> L2)
        {
            Dictionary<T, int> Appearances = new Dictionary<T, int>();
            for (int i = 0; i < L1.Count; i++) Appearances[L1[i]] = 1;
            for (int i = 0; i < L2.Count; i++) if (Appearances.ContainsKey(L2[i])) Appearances[L2[i]] = 2;
            for (int i = L1.Count - 1; i >= 0; i--) if (!(Appearances.ContainsKey(L1[i]) && Appearances[L1[i]] == 2)) L1.RemoveAt(i);
            for (int i = L2.Count - 1; i >= 0; i--) if (!(Appearances.ContainsKey(L2[i]) && Appearances[L2[i]] == 2)) L2.RemoveAt(i);
            Dictionary<T, int> IndInL1 = new Dictionary<T, int>();
            for (int i = 0; i < L1.Count; i++) IndInL1[L1[i]] = i;

            Dictionary<T, int> Breakers = new Dictionary<T, int>();

            int steps = 0;
            int me = 0;
            while (true)
            {
                steps++;
                int minError = int.MaxValue;
                int minErrorIndex = -1;

                for (int from = 0; from < L2.Count; from++)
                {
                    T x = L2[from];
                    int to = IndInL1[x];
                    if (from == to) continue;

                    L2.RemoveAt(from);
                    L2.Insert(to, x);

                    int error = 0;
                    for (int i = 0; i < L2.Count; i++)
                        error += Math.Abs((i - IndInL1[L2[i]]));

                    L2.RemoveAt(to);
                    L2.Insert(from, x);

                    if (error < minError)
                    {
                        minError = error;
                        minErrorIndex = from;
                    }
                }

                if (minErrorIndex == -1) break;

                T breaker = L2[minErrorIndex];
                int breakerOriginalPosition = IndInL1[breaker];

                L2.RemoveAt(minErrorIndex);
                L2.Insert(breakerOriginalPosition, breaker);

                Breakers[breaker] = 1;

                me = minError;
            }
            Console.WriteLine("Breakers: " + Breakers.Count + " Steps: " + steps);
            foreach (KeyValuePair<T, int> p in Breakers)
                Console.WriteLine(p.Key);
            Console.ReadLine();
        }
    }
}

对两个词典进行排序，并找出排序列表中项目的索引位置的差异

3 个答案: