按未知的初始前缀分组

时间:2013-05-06 12:20:04

标签: c# linq grouping

假设我有以下字符串数组作为输入:

foo-139875913
foo-aeuefhaiu
foo-95hw9ghes
barbazabejgoiagjaegioea
barbaz8gs98ghsgh9es8h
9a8efa098fea0
barbaza98fyae9fghaefag
bazfa90eufa0e9u
bazgeajga8ugae89u
bazguea9guae
aifeaufhiuafhe

这里使用了3个不同的前缀,“foo-”,“barbaz”和“baz” - 但是这些前缀未提前知道(它们可能完全不同)。

你怎么能确定不同的公共前缀是什么,以便它们可以被分组?这有点棘手,因为在我提供的数据中,有两个以“bazg”开头,另一个以“bazf”开头,当然“baz”是前缀。

我到目前为止尝试的是将它们按字母顺序排序,然后按顺序循环它们并计算一行中有多少个字符与前一个字符相同。如果数字不同或0个字符相同,则会启动一个新组。这个问题是它在我前面提到的“bazg”和“bazf”问题上出现了问题,并将它们分成两个不同的组(一个只有一个元素)

编辑:好的,让我们再提出一些规则:

  • 较长的潜在群体通常应优先于较短群体,除非存在长度小于X个字符的紧密匹配群体。 (所以当X为2时,baz优于bazg)
  • 一个群组中必须至少包含Y个元素,或者根本不是一个群组
  • 可以在上述规则中丢弃与任何“组”不匹配的元素。

澄清与第二条规则相关的第一条规则,如果X为0且Y为2,则两个“bazg”条目将在一个组中,并且'bazf'将被丢弃,因为它独立

3 个答案:

答案 0 :(得分:5)

嗯,这是一个快速的黑客,可能是O(something_bad)

IEnumerable<Tuple<String, IEnumerable<string>>> GuessGroups(IEnumerable<string> source, int minNameLength=0, int minGroupSize=1)
{
    // TODO: error checking
    return InnerGuessGroups(new Stack<string>(source.OrderByDescending(x => x)), minNameLength, minGroupSize);
}

IEnumerable<Tuple<String, IEnumerable<string>>> InnerGuessGroups(Stack<string> source, int minNameLength, int minGroupSize)
{
    if(source.Any())
    {
        var tuple = ExtractTuple(GetBestGroup(source, minNameLength), source);
        if (tuple.Item2.Count() >= minGroupSize)
            yield return tuple;
        foreach (var element in GuessGroups(source, minNameLength, minGroupSize))
            yield return element;   
    }
}

Tuple<String, IEnumerable<string>> ExtractTuple(string prefix, Stack<string> source)
{
    return Tuple.Create(prefix, PopWithPrefix(prefix, source).ToList().AsEnumerable());
}

IEnumerable<string> PopWithPrefix(string prefix, Stack<string> source)
{
    while (source.Any() && source.Peek().StartsWith(prefix))
        yield return source.Pop();
}

string GetBestGroup(IEnumerable<string> source, int minNameLength)
{
    var s = new Stack<string>(source);
    var counter = new DictionaryWithDefault<string, int>(0);
    while(s.Any())
    {
        var g = GetCommonPrefix(s);
        if(!string.IsNullOrEmpty(g) && g.Length >= minNameLength)
            counter[g]++;
        s.Pop();
    }
    return counter.OrderBy(c => c.Value).Last().Key;
}

string GetCommonPrefix(IEnumerable<string> coll)
{
    return (from len in Enumerable.Range(0, coll.Min(s => s.Length)).Reverse()
            let possibleMatch = coll.First().Substring(0, len)
            where coll.All(f => f.StartsWith(possibleMatch))
            select possibleMatch).FirstOrDefault();
}

public class DictionaryWithDefault<TKey, TValue> : Dictionary<TKey, TValue>
{
  TValue _default;
  public TValue DefaultValue {
    get { return _default; }
    set { _default = value; }
  }
  public DictionaryWithDefault() : base() { }
  public DictionaryWithDefault(TValue defaultValue) : base() {
    _default = defaultValue;
  }
  public new TValue this[TKey key]
  {
    get { return base.ContainsKey(key) ? base[key] : _default; }
    set { base[key] = value; }
  }
}

使用示例:

string[] input = {
    "foo-139875913",
    "foo-aeuefhaiu",
    "foo-95hw9ghes",
    "barbazabejgoiagjaegioea",
    "barbaz8gs98ghsgh9es8h",
    "barbaza98fyae9fghaefag",
    "bazfa90eufa0e9u",
    "bazgeajga8ugae89u",
    "bazguea9guae",
    "9a8efa098fea0",
    "aifeaufhiuafhe"
};

GuessGroups(input, 3, 2).Dump();

enter image description here

答案 1 :(得分:1)

好的,正如所讨论的那样,这个问题最初没有明确定义,但这就是我如何去做。

Create a tree T
Parse the list, for each element:
    for each letter in that element
        if a branch labeled with that letter exists then 
            Increment the counter on that branch
            Descend that branch
        else 
            Create a branch labelled with that letter
            Set its counter to 1
            Descend that branch

这将为您提供一个树,其中每个叶子代表输入中的一个单词。每个非叶节点都有一个计数器,表示有多少叶子(最终)连接到该节点。现在,您需要一个公式来根据前缀组的大小加权前缀的长度(节点的深度)。现在:

S = (a * d) + (b * q) // d = depth, q = quantity, a, b coefficients you'll tweak to get desired behaviour

所以现在你可以迭代每个非叶节点并为它们分配一个得分S.然后,为了计算你的组,你会

For each non-leaf node
    Assign score S
    Insertion sort the node in to a list, so the head is the highest scoring node

Starting at the root of the tree, traverse the nodes
    If the node is the highest scoring node in the list
        Mark it as a prefix 
        Remove all nodes from the list that are a descendant of it
        Pop itself off the front of the list
        Return up the tree

这应该会给你一个前缀列表。最后一部分感觉就像一些聪明的数据结构或算法可以加快它(删除所有孩子的最后一部分感觉特别弱,但如果你输入的尺寸很小,我想速度不是太重要)。

答案 2 :(得分:0)

我想知道你的要求是否没有关闭。看起来好像您正在寻找特定的分组大小而不是特定的密钥大小要求。我有一个程序,它将根据指定的组大小,将字符串分解为最大可能组,并包括指定的组大小。因此,如果您指定组大小为5,那么它会将最小键上的项目分组,以构成一个大小为5的组。在您的示例中,它将 foo - 分组为 f 因为不需要将更复杂的密钥作为标识符。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConsoleApplication2
{
    class Program
    {
        /// <remarks><c>true</c> in returned dictionary key are groups over <paramref name="maxGroupSize"/></remarks>
        public static Dictionary<bool,Dictionary<string, List<string>>> Split(int maxGroupSize, int keySize, IEnumerable<string> items)
        {
            var smallItems = from item in items
                             where item.Length < keySize
                             select item;
            var largeItems = from item in items
                             where keySize < item.Length
                             select item;
            var largeItemsq = (from item in largeItems
                               let key = item.Substring(0, keySize)
                               group item by key into x
                               select new { Key = x.Key, Items = x.ToList() } into aGrouping
                               group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
                               select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
            if (smallItems.Any())
            {
                var smallestLength = items.Aggregate(int.MaxValue, (acc, item) => Math.Min(acc, item.Length));
                var smallItemsq = (from item in smallItems
                                   let key = item.Substring(0, smallestLength)
                                   group item by key into x
                                   select new { Key = x.Key, Items = x.ToList() } into aGrouping
                                   group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
                                   select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
                return Combine(smallItemsq, largeItemsq);
            }
            return largeItemsq;
        }

        static Dictionary<bool, Dictionary<string,List<string>>> Combine(Dictionary<bool, Dictionary<string,List<string>>> a, Dictionary<bool, Dictionary<string,List<string>>> b) {
            var x = new Dictionary<bool,Dictionary<string,List<string>>> {
                { true, null },
                { false, null }
            };
            foreach(var condition in new bool[] { true, false }) {
                var hasA = a.ContainsKey(condition);
                var hasB = b.ContainsKey(condition);
                x[condition] = hasA && hasB ? a[condition].Concat(b[condition]).ToDictionary(c => c.Key, c => c.Value)
                    : hasA ? a[condition]
                    : hasB ? b[condition]
                    : new Dictionary<string, List<string>>();
            }
            return x;
        }

        public static Dictionary<string, List<string>> Group(int maxGroupSize, IEnumerable<string> items, int keySize)
        {
            var toReturn = new Dictionary<string, List<string>>();
            var both = Split(maxGroupSize, keySize, items);
            if (both.ContainsKey(false))
                foreach (var key in both[false].Keys)
                    toReturn.Add(key, both[false][key]);
            if (both.ContainsKey(true))
            {
                var keySize_ = keySize + 1;
                var xs = from needsFix in both[true]
                         select needsFix;
                foreach (var x in xs)
                {
                    var fixedGroup = Group(maxGroupSize, x.Value, keySize_);
                    toReturn = toReturn.Concat(fixedGroup).ToDictionary(a => a.Key, a => a.Value);
                }
            }
            return toReturn;
        }

        static Random rand = new Random(unchecked((int)DateTime.Now.Ticks));
        const string allowedChars = "aaabbbbccccc"; // "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ";
        static readonly int maxAllowed = allowedChars.Length - 1;

        static IEnumerable<string> GenerateText()
        {
            var list = new List<string>();
            for (int i = 0; i < 100; i++)
            {
                var stringLength = rand.Next(3,25);
                var chars = new List<char>(stringLength);
                for (int j = stringLength; j > 0; j--)
                    chars.Add(allowedChars[rand.Next(0, maxAllowed)]);
                var newString = chars.Aggregate(new StringBuilder(), (acc, item) => acc.Append(item)).ToString();
                list.Add(newString);
            }
            return list;
        }

        static void Main(string[] args)
        {
            // runs 1000 times over autogenerated groups of sample text.
            for (int i = 0; i < 1000; i++)
            {
                var s = GenerateText();
                Go(s);
            }
            Console.WriteLine();
            Console.WriteLine("DONE");
            Console.ReadLine();
        }

        static void Go(IEnumerable<string> items)
        {
            var dict = Group(3, items, 1);
            foreach (var key in dict.Keys)
            {
                Console.WriteLine(key);
                foreach (var item in dict[key])
                    Console.WriteLine("\t{0}", item);
            }
        }

    }
}