Question

我正在尝试构建一个有效的算法，可以处理包含客户邮政编码的数千行数据。然后，我想要针对大约1000个邮政编码的分组交叉检查这些邮政编码，但我有大约100列1000个邮政编码。很多这些邮政编码是连续的数字，但也有很多随机邮政编码。所以我想要做的是将连续的邮政编码组合在一起，然后我可以检查邮政编码是否在该范围内，而不是针对每一个邮政编码进行检查。

示例数据 -

这应分组如下：

{ 90001-90010, 90012, 90022, 90031-90034, 90041 }

这是我对算法的想法：

public struct gRange {
   public int start, end;

   public gRange(int a, int b) {
      start = a;
      if(b != null) end = b;
      else end = a;
   }
}

function groupZips(string[] zips){
    List<gRange> zipList = new List<gRange>();
    int currZip, prevZip, startRange, endRange;
    startRange = 0;

    bool inRange = false;

    for(int i = 1; i < zips.length; i++) {
        currZip = Convert.ToInt32(zips[i]);
        prevZip = Convert.ToInt32(zips[i-1]);

        if(currZip - prevZip == 1 && inRange == false) {
            inRange = true;
            startRange = prevZip;
            continue;
        }
        else if(currZip - prevZip == 1 && inRange == true) continue;
        else if(currZip - prevZip != 1 && inRange == true) {
            inRange = false;
            endRange = prevZip;
            zipList.add(new gRange(startRange, endRange));
            continue;
        }
        else if(currZip - prevZip != 1 && inRange == false) {
            zipList.add(new gRange(prevZip, prevZip));
        }
        //not sure how to handle the last case when i == zips.length-1
    }
}

到目前为止，我不确定如何处理最后一个案例，但是看看这个算法，它并没有让我觉得有效。是否有更好/更简单的方法来排序这样的一组数字？

Answer 1

这是一个O(n)解决方案，即使您的邮政编码不能保证有序。

如果您需要对输出分组进行排序，则不能比O(n*log(n))做得更好，因为您需要对某些内容进行排序，但如果将邮政编码分组是您唯一关注的问题并且不需要对组进行排序，然后我会使用这样的算法。它充分利用了HashSet，词典和DoublyLinkedList。据我所知，此算法为O(n)，because I believe that a HashSet.Add() and HashSet.Contains() are performed in constant time。

这是一个有效的dotnetfiddle

// I'm assuming zipcodes are ints... convert if desired
// jumbled up your sample data to show that the code would still work
var zipcodes = new List<int>
{
    90012,
    90033,
    90009,
    90001,
    90005,
    90004,
    90041,
    90008,
    90007,
    90031,
    90010,
    90002,
    90003,
    90034,
    90032,
    90006,
    90022,
};

// facilitate constant-time lookups of whether zipcodes are in your set
var zipHashSet = new HashSet<int>();

// lookup zipcode -> linked list node to remove item in constant time from the linked list
var nodeDictionary = new Dictionary<int, DoublyLinkedListNode<int>>();

// linked list for iterating and grouping your zip codes in linear time
var zipLinkedList = new DoublyLinkedList<int>();

// initialize our datastructures from the initial list
foreach (int zipcode in zipcodes)
{
    zipLinkedList.Add(zipcode);
    zipHashSet.Add(zipcode);
    nodeDictionary[zipcode] = zipLinkedList.Last;
}

// object to store the groupings (ex: "90001-90010", "90022")
var groupings = new HashSet<string>();

// iterate through the linked list, but skip nodes if we group it with a zip code
// that we found on a previous iteration of the loop
var node = zipLinkedList.First;
while (node != null)
{
    var bottomZipCode = node.Element;
    var topZipCode = bottomZipCode;

    // find the lowest zip code in this group
    while (zipHashSet.Contains(bottomZipCode - 1))
    {
        var nodeToDel = nodeDictionary[bottomZipCode - 1];

        // delete node from linked list so we don't observe any node more than once
        if (nodeToDel.Previous != null)
        {
            nodeToDel.Previous.Next = nodeToDel.Next;
        }
        if (nodeToDel.Next != null)
        {
            nodeToDel.Next.Previous = nodeToDel.Previous;
        }
        // see if previous zip code is in our group, too
        bottomZipCode--;
    }
    // get string version zip code bottom of the range
    var bottom = bottomZipCode.ToString();

    // find the highest zip code in this group
    while (zipHashSet.Contains(topZipCode + 1))
    {
        var nodeToDel = nodeDictionary[topZipCode + 1];

        // delete node from linked list so we don't observe any node more than once
        if (nodeToDel.Previous != null)
        {
            nodeToDel.Previous.Next = nodeToDel.Next;
        }
        if (nodeToDel.Next != null)
        {
            nodeToDel.Next.Previous = nodeToDel.Previous;
        }

        // see if next zip code is in our group, too
        topZipCode++;
    }

    // get string version zip code top of the range
    var top = topZipCode.ToString();

    // add grouping in correct format
    if (top == bottom)
    {
        groupings.Add(bottom);
    }
    else
    {
        groupings.Add(bottom + "-" + top);
    }

    // onward!
    node = node.Next;
}


// print results
foreach (var grouping in groupings)
{
    Console.WriteLine(grouping);
}

**公共链表节点删除逻辑的小重构按顺序

如果需要排序

O(n*log(n))算法要简单得多，因为一旦对输入列表进行排序，就可以在列表的一次迭代中形成组，而不需要额外的数据结构。

Answer 2

我相信你正在推翻这个。仅使用Linq对抗IEnumerable就可以在不到1/10秒的时间内搜索80,000多条记录。

我使用了免费的CSV邮政编码列表：http://federalgovernmentzipcodes.us/free-zipcode-database.csv

using System;
using System.IO;
using System.Collections.Generic;
using System.Data;
using System.Data.OleDb;
using System.Linq;
using System.Text;

namespace ZipCodeSearchTest
{
    struct zipCodeEntry
    {
        public string ZipCode { get; set; }
        public string City { get; set; }
    }
    class Program
    {
        static void Main(string[] args)
        {
            List<zipCodeEntry> zipCodes = new List<zipCodeEntry>();

            string dataFileName = "free-zipcode-database.csv";
            using (FileStream fs = new FileStream(dataFileName, FileMode.Open, FileAccess.Read))
            using (StreamReader sr = new StreamReader(fs))
                while (!sr.EndOfStream)
                {
                    string line = sr.ReadLine();
                    string[] lineVals = line.Split(',');
                    zipCodes.Add(new zipCodeEntry { ZipCode = lineVals[1].Trim(' ', '\"'), City = lineVals[3].Trim(' ', '\"') });
                }

            bool terminate = false;
            while (!terminate)
            {
                Console.WriteLine("Enter zip code:");
                var userEntry = Console.ReadLine();
                if (userEntry.ToLower() == "x" || userEntry.ToString() == "q")
                    terminate = true;
                else
                {
                    DateTime dtStart = DateTime.Now;
                    foreach (var arrayVal in zipCodes.Where(z => z.ZipCode == userEntry.PadLeft(5, '0')))
                        Console.WriteLine(string.Format("ZipCode: {0}", arrayVal.ZipCode).PadRight(20, ' ') + string.Format("City: {0}", arrayVal.City));
                    DateTime dtStop = DateTime.Now;
                    Console.WriteLine();
                    Console.WriteLine("Lookup time: {0}", dtStop.Subtract(dtStart).ToString());
                    Console.WriteLine("\n\n");
                }
            }
        }
    }
}

Answer 3

在这种特殊情况下，哈希很可能会更快。但是，基于范围的解决方案将使用更少的内存，因此如果您的列表非常大，那将是合适的（并且我不相信任何列表都有足够的可能的zipcodes zipcodes足够大。）

无论如何，这里有一个更简单的逻辑，用于制作范围列表并查找目标是否在范围内：

使ranges成为一个简单的整数列表（甚至是zipcodes），并将zip的第一个元素作为其第一个元素。
对于zip除了最后一个元素之外的每个元素，如果该元素加1与下一个元素不同，则将该元素加上一个，将下一个元素添加到ranges
在“范围”末尾推送一个zip的最后一个元素。

现在，要查明邮政编码是否在ranges中，请对ranges进行二进制搜索，查找大于目标邮政编码的最小元素。 [注1]如果该元素的索引是奇数，则目标在其中一个范围内，否则它不是。

注意：

AIUI，C＃列表中的BinarySearch方法返回找到的元素的索引或第一个较大元素的索引的补码。要获得建议算法所需的结果，您可以使用index >= 0 ? index + 1 : ~index之类的东西，但是只搜索比目标少一个的zipcode可能更简单，然后使用低位的补码。结果

用于分组连续数字的算法

3 个答案:

如果需要排序

注意：