删除字符串中的连续重复项以生成最小的字符串

时间:2018-04-05 01:39:27

标签: algorithm

给定一个字符串以及匹配> = 3个字符的约束,如何确保结果字符串尽可能小?

使用gassa的显式进行编辑:

'AAAABBBAC'

如果我先移除B, AAAA[BBB]AC -- > AAAAAC,然后我可以从结果字符串中删除所有A,并留下:

[AAAAA]C --> C

'C'

如果我先删除可用的内容(A的序列),我会得到:

[AAAA]BBBAC -- > [BBB]AC --> AC

'AC'

5 个答案:

答案 0 :(得分:4)

肯定会为您提供最短的字符串。

树解决方案

  1. 为每个当前State及其所有可移动子字符串定义string Input(节点)' int[] Indexes
  2. 创建树:为每个int index创建另一个State并将其添加到父状态State[] Children
  3. 没有可能的可移动子字符串的State没有子级Children = null
  4. 获取根State[]的所有后代State。按最短的string Input订购。这就是你的答案。
  5. 测试用例

    string result = FindShortest("AAAABBBAC");      // AC
    string result2 = FindShortest("AABBAAAC");      // AABBC
    string result3 = FindShortest("BAABCCCBBA");    // B
    

    守则

    注意:当然,欢迎所有人在性能和/或修复任何错误方面增强以下代码。

    class Program
    {
        static void Main(string[] args)
        {
            string result = FindShortest("AAAABBBAC");      // AC
            string result2 = FindShortest("AABBAAAC");      // AABBC
            string result3 = FindShortest("BAABCCCBBA");    // B
        }
    
        // finds the FIRST shortest string for a given input
        private static string FindShortest(string input)
        {
            // all possible removable strings' indexes
            // for this given input
            int[] indexes = RemovableIndexes(input);
    
            // each input string and its possible removables are a state
            var state = new State { Input = input, Indexes = indexes };
    
            // create the tree
            GetChildren(state);
    
            // get the FIRST shortest
            // i.e. there would be more than one answer sometimes
            // this could be easily changed to get all possible results
            var result = 
                Descendants(state)
                .Where(d => d.Children == null || d.Children.Length == 0)
                .OrderBy(d => d.Input.Length)
                .FirstOrDefault().Input;
    
    
            return result;
        }
    
        // simple get all descendants of a node/state in a tree
        private static IEnumerable<State> Descendants(State root)
        {
            var states = new Stack<State>(new[] { root });
            while (states.Any())
            {
                State node = states.Pop();
                yield return node;
                if (node.Children != null)
                    foreach (var n in node.Children) states.Push(n);
            }
        }
    
        // creates the tree
        private static void GetChildren(State state)
        {
            // for each an index there is a child
            state.Children = state.Indexes.Select(
                    i =>
                    {
                        var input = RemoveAllAt(state.Input, i);
                        return input.Length < state.Input.Length && input.Length > 0
                        ? new State
                        {
                            Input = input,
                            Indexes = RemovableIndexes(input)
                        }
                        : null;
                    }).ToArray();
    
            foreach (var c in state.Children)
                GetChildren(c);
        }
    
        // find all possible removable strings' indexes
        private static int[] RemovableIndexes(string input)
        {
            var indexes = new List<int>();
    
            char d = input[0];
            int count = 1;
            for (int i = 1; i < input.Length; i++)
            {
                if (d == input[i])
                    count++;
                else
                {
                    if (count >= 3)
                        indexes.Add(i - count);
    
                    // reset
                    d = input[i];
                    count = 1;
                }
            }
            if (count >= 3)
                indexes.Add(input.Length - count);
    
    
            return indexes.ToArray();
        }
    
        // remove all duplicate chars starting from an index
        private static string RemoveAllAt(string input, int startIndex)
        {
            string part1, part2;
    
            int endIndex = startIndex + 1;
            int i = endIndex;
            for (; i < input.Length; i++)
                if (input[i] != input[startIndex])
                {
                    endIndex = i;
                    break;
                }
    
            if (i == input.Length && input[i - 1] == input[startIndex])
                endIndex = input.Length;
    
            part1 = startIndex > 0 ? input.Substring(0, startIndex) : string.Empty;
            part2 = endIndex <= (input.Length - 1) ? input.Substring(endIndex) : string.Empty;
    
            return part1 + part2;
        }
    
        // our node, which is 
        // an input string & 
        // all possible removable strings' indexes
        // & its children
        public class State
        {
            public string Input;
            public int[] Indexes;
    
            public State[] Children;
        }
    }
    

答案 1 :(得分:3)

显然,我们不关心任何超过2个字符的重复字符块。并且只有一种方法可以组合两个相同字符的块,其中至少一个块的长度小于3可以组合 - 即,如果它们之间的序列可以被移除。

所以(1)查看相同字符的块对,其中至少一个长度小于3,(2)确定它们之间的序列是否可以被删除。

我们想要决定加入哪些对,以便最小化长度小于3个字符的块的总长度。 (请注意,对的数量受字母表大小(和分布)的约束。)

f(b)表示剩余的块b中长度小于3个字符的相同字符块的最小总长度。然后:

f(b):
  p1 <- previous block of the same character

  if b and p1 can combine:
    if b.length + p1.length > 2:
      f(b) = min(
        // don't combine
        (0 if b.length > 2 else b.length) +
          f(block before b),
        // combine
        f(block before p1)
      )

    // b.length + p1.length < 3
    else:
      p2 <- block previous to p1 of the same character

      if p1 and p2 can combine:
        f(b) = min(
          // don't combine
          b.length + f(block before b),
          // combine
          f(block before p2)
        )
      else:
        f(b) = b.length + f(block before b)

  // b and p1 cannot combine
  else:
    f(b) = b.length + f(block before b)

  for all p1 before b

问题是我们如何才能有效地确定一个块是否可以与同一个字符的前一个块组合(除了明显的递归到两个块之间的子块列表中)。

Python代码:

import random
import time

def parse(length):
  return length if length < 3 else 0

def f(string):
  chars = {}
  blocks = [[string[0], 1, 0]]

  chars[string[0]] = {'indexes': [0]}
  chars[string[0]][0] = {'prev': -1}

  p = 0 # pointer to current block

  for i in xrange(1, len(string)):
    if blocks[len(blocks) - 1][0] == string[i]:
      blocks[len(blocks) - 1][1] += 1
    else:
      p += 1
      # [char, length, index, f(i), temp] 
      blocks.append([string[i], 1, p])

      if string[i] in chars:
        chars[string[i]][p] = {'prev': chars[string[i]]['indexes'][ len(chars[string[i]]['indexes']) - 1 ]}
        chars[string[i]]['indexes'].append(p)
      else:
        chars[string[i]] = {'indexes': [p]}
        chars[string[i]][p] = {'prev': -1}

  #print blocks
  #print
  #print chars
  #print

  memo = [[None for j in xrange(len(blocks))] for i in xrange(len(blocks))]

  def g(l, r, top_level=False):
    ####
    ####
    #print "(l, r): (%s, %s)" % (l,r)

    if l == r:
      return parse(blocks[l][1])
    if memo[l][r]:
      return memo[l][r]

    result = [parse(blocks[l][1])] + [None for k in xrange(r - l)]

    if l < r:
      for i in xrange(l + 1, r + 1):
        result[i - l] = parse(blocks[i][1]) + result[i - l - 1]

    for i in xrange(l, r + 1):
      ####
      ####
      #print "\ni: %s" % i

      [char, length, index] = blocks[i]
      #p1 <- previous block of the same character
      p1_idx = chars[char][index]['prev']

      ####
      ####
      #print "(p1_idx, l, p1_idx >= l): (%s, %s, %s)" % (p1_idx, l, p1_idx >= l)

      if p1_idx < l and index > l:
        result[index - l] = parse(length) + result[index - l - 1]

      while p1_idx >= l:
        p1 = blocks[p1_idx]

        ####
        ####
        #print "(b, p1, p1_idx, l): (%s, %s, %s, %s)\n" % (blocks[i], p1, p1_idx, l)

        between = g(p1[2] + 1, index - 1)

        ####
        ####
        #print "between: %s" % between

        #if b and p1 can combine:
        if between == 0:
          if length + p1[1] > 2:
            result[index - l] = min(
              result[index - l],
              # don't combine
              parse(length) + (result[index - l - 1] if index - l > 0 else 0),
              # combine: f(block before p1)
              result[p1[2] - l - 1] if p1[2] > l else 0
            )

          # b.length + p1.length < 3
          else:
            #p2 <- block previous to p1 of the same character
            p2_idx = chars[char][p1[2]]['prev']

            if p2_idx < l:
              p1_idx = chars[char][p1_idx]['prev']
              continue

            between2 = g(p2_idx + 1, p1[2] - 1)
            #if p1 and p2 can combine:
            if between2 == 0:
              result[index - l] = min(
                result[index - l],
                # don't combine
                parse(length) + (result[index - l - 1] if index - l > 0 else 0),
                # combine the block, p1 and p2
                result[p2_idx - l - 1] if p2_idx - l > 0 else 0
              )
            else:
              #f(b) = b.length + f(block before b)
              result[index - l] = min(
                result[index - l],
                parse(length) + (result[index - l - 1] if index - l > 0 else 0)
              )

        # b and p1 cannot combine
        else:
          #f(b) = b.length + f(block before b)
          result[index - l] = min(
            result[index - l],
            parse(length) + (result[index - l - 1] if index - l > 0 else 0)
          )

        p1_idx = chars[char][p1_idx]['prev']

    #print l,r,result
    memo[l][r] = result[r - l]

    """if top_level:
      return (result, blocks)
    else:"""
    return result[r - l]

  if len(blocks) == 1:
    return ([parse(blocks[0][1])], blocks)
  else:
    return g(0, len(blocks) - 1, True)

"""s = ""

for i in xrange(300):
  s = s + ['A','B','C'][random.randint(0,2)]"""

print f("abcccbcccbacccab") # b
print
print f("AAAABBBAC");      # C
print
print f("CAAAABBBA");      # C
print
print f("AABBAAAC");      # AABBC
print
print f("BAABCCCBBA");    # B
print
print f("aaaa")
print

使用jdehesa&#39; s answer计算这些较长示例的字符串答案:

t0 = time.time()
print f("BCBCCBCCBCABBACCBABAABBBABBBACCBBBAABBACBCCCACABBCAABACBBBBCCCBBAACBAABACCBBCBBAABCCCCCAABBBBACBBAAACACCBCCBBBCCCCCCCACBABACCABBCBBBBBCBABABBACCAACBCBBAACBBBBBCCBABACBBABABAAABCCBBBAACBCACBAABAAAABABB")
# BCBCCBCCBCABBACCBABCCAABBACBACABBCAABACAACBAABACCBBCBBCACCBACBABACCABBCCBABABBACCAACBCBBAABABACBBABABBCCAACBCACBAABBABB
t1 = time.time()
total = t1-t0
print total

t0 = time.time()
print f("CBBACAAAAABBBBCAABBCBAABBBCBCBCACACBAABCBACBBABCABACCCCBACBCBBCBACBBACCCBAAAACACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBCCCABACABBCABBAAAAABBBBAABAABBCACACABBCBCBCACCCBABCAACBCAAAABCBCABACBABCABCBBBBABCBACABABABCCCBBCCBBCCBAAABCABBAAABBCAAABCCBAABAABCAACCCABBCAABCBCBCBBAACCBBBACBBBCABAABCABABABABCA")
# CBBACCAABBCBAACBCBCACACBAABCBACBBABCABABACBCBBCBACBBABCACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBABACABBCBBCACACABBCBCBCABABCAACBCBCBCABACBABCABCABCBACABABACCBBCCBBCACBCCBAABAABCBBCAABCBCBCBBAACCACCABAABCABABABABCA
t1 = time.time()
total = t1-t0
print total

t0 = time.time()
print f("AADBDBEBBBBCABCEBCDBBBBABABDCCBCEBABADDCABEEECCECCCADDACCEEAAACCABBECBAEDCEEBDDDBAAAECCBBCEECBAEBEEEECBEEBDACDDABEEABEEEECBABEDDABCDECDAABDAEADEECECEBCBDDAEEECCEEACCBBEACDDDDBDBCCAAECBEDAAAADBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDAEEEBBBCEDECBCABDEDEBBBABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCECCCA")
# AADBDBECABCEBCDABABDCCBCEBABADDCABCCEADDACCEECCABBECBAEDCEEBBECCBBCEECBAEBCBEEBDACDDABEEABCBABEDDABCDECDAABDAEADEECECEBCBDDACCEEACCBBEACBDBCCAAECBEDDBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDACEDECBCABDEDEABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCEA
t1 = time.time()
total = t1-t0
print total

答案 2 :(得分:2)

我建议使用动态编程的O(n ^ 2)解决方案。

让我们介绍一下符号。由P [1]和S [l]表示的字符串A的长度l的前缀和后缀。我们称之为程序Rcd。

  1. Rcd(A)= Rcd(Rcd(P [n-1])+ S [1])
  2. Rcd(A)= Rcd(P [1] + Rcd(S [n-1]))
  3. 请注意,RHS中的外部Rcd是微不足道的。所以,这是我们的最佳子结构。基于此,我想出了以下实现:

    #include <iostream>
    #include <string>
    #include <vector>
    #include <cassert>
    using namespace std;
    
    string remdupright(string s, bool allowEmpty) {
        if (s.size() >= 3) {
            auto pos = s.find_last_not_of(s.back());
            if (pos == string::npos && allowEmpty) s = "";
            else if (pos != string::npos && s.size() - pos > 3) s = s.substr(0, pos + 1);
        }
        return s;
    }
    
    string remdupleft(string s, bool allowEmpty) {
        if (s.size() >= 3) {
            auto pos = s.find_first_not_of(s.front());
            if (pos == string::npos && allowEmpty) s = "";
            else if (pos != string::npos && pos >= 3) s = s.substr(pos);
        }
        return s;
    }
    
    string remdup(string s, bool allowEmpty) {
        return remdupleft(remdupright(s, allowEmpty), allowEmpty);
    }
    
    string run(const string in) {
        vector<vector<string>> table(in.size());
        for (int i = 0; i < (int)table.size(); ++i) {
            table[i].resize(in.size() - i);
        }
        for (int i = 0; i < (int)table[0].size(); ++i) {
            table[0][i] = in.substr(i,1);
        }
    
        for (int len = 2; len <= (int)table.size(); ++len) {
            for (int pos = 0; pos < (int)in.size() - len + 1; ++pos) {
                string base(table[len - 2][pos]);
                const char suffix = in[pos + len - 1];
                if (base.size() && suffix != base.back()) {
                    base = remdupright(base, false);
                }
                const string opt1 = base + suffix;
    
                base = table[len - 2][pos+1];
                const char prefix = in[pos];
                if (base.size() && prefix != base.front()) {
                    base = remdupleft(base, false);
                }
                const string opt2 = prefix + base;
    
                const string nodupopt1 = remdup(opt1, true);
                const string nodupopt2 = remdup(opt2, true);
    
                table[len - 1][pos] = nodupopt1.size() > nodupopt2.size() ? opt2 : opt1;
                assert(nodupopt1.size() != nodupopt2.size() || nodupopt1 == nodupopt2);
            }
        }
        string& res = table[in.size() - 1][0];
        return remdup(res, true);
    }
    
    void testRcd(string s, string expected) {
        cout << s << " : " << run(s) << ", expected: " << expected << endl;
    }
    
    int main()
    {
        testRcd("BAABCCCBBA", "B");
        testRcd("AABBAAAC", "AABBC");
        testRcd("AAAA", "");
        testRcd("AAAABBBAC", "C");
    }
    

    您可以检查默认值并运行测试here

答案 3 :(得分:1)

这是一个Python解决方案(函数reduce_min),不是特别聪明,但我认为相当容易理解(为了清晰答案而添加了过多的注释):

def reductions(s, min_len):
    """
    Yields every possible reduction of s by eliminating contiguous blocks
    of l or more repeated characters.
    For example, reductions('AAABBCCCCBAAC', 3) yields
    'BBCCCCBAAC' and 'AAABBBAAC'.
    """
    # Current character
    curr = ''
    # Length of current block
    n = 0
    # Start position of current block
    idx = 0
    # For each character
    for i, c in enumerate(s):
        if c != curr:
            # New block begins
            if n >= min_len:
                # If previous block was long enough
                # yield reduced string without it
                yield s[:idx] + s[i:]
            # Start new block
            curr = c
            n = 1
            idx = i
        else:
            # Still in the same block
            n += 1
    # Yield reduction without last block if it was long enough
    if n >= min_len:
        yield s[:idx]

def reduce_min(s, min_len):
    """
    Finds the smallest possible reduction of s by successive
    elimination of contiguous blocks of min_len or more repeated
    characters.
    """
    # Current set of possible reductions
    rs = set([s])
    # Current best solution
    result = s
    # While there are strings to reduce
    while rs:
        # Get one element
        r = rs.pop()
        # Find reductions
        r_red = list(reductions(r, min_len))
        # If no reductions are found it is irreducible
        if len(r_red) == 0 and len(r) < len(result):
            # Replace if shorter than current best
            result = r
        else:
            # Save reductions for next iterations
            rs.update(r_red)
    return result

assert reduce_min("BAABCCCBBA", 3) == "B"
assert reduce_min("AABBAAAC", 3) == "AABBC"
assert reduce_min("AAAA", 3) == ""
assert reduce_min("AAAABBBAC", 3) == "C"

编辑:由于人们似乎在发布C ++解决方案,所以这是我的C ++(同样,函数reduce_min):

#include <string>
#include <vector>
#include <unordered_set>
#include <iterator>
#include <utility>
#include <cassert>

using namespace std;

void reductions(const string &s, unsigned int min_len, vector<string> &rs)
{
    char curr = '\0';
    unsigned int n = 0;
    unsigned int idx = 0;
    for (auto it = s.begin(); it != s.end(); ++it)
    {
        if (curr != *it)
        {
            auto i = distance(s.begin(), it);
            if (n >= min_len)
            {
                rs.push_back(s.substr(0, idx) + s.substr(i));
            }
            curr = *it;
            n = 1;
            idx = i;
        }
        else
        {
            n += 1;
        }
    }
    if (n >= min_len)
    {
        rs.push_back(s.substr(0, idx));
    }
}

string reduce_min(const string &s, unsigned int min_len)
{
    unordered_set<string> rs { s };
    string result = s;
    vector<string> rs_new;
    while (!rs.empty())
    {
        auto it = rs.begin();
        auto r = *it;
        rs.erase(it);
        rs_new.clear();
        reductions(r, min_len, rs_new);
        if (rs_new.empty() && r.size() < result.size())
        {
            result = move(r);
        }
        else
        {
            rs.insert(rs_new.begin(), rs_new.end());
        }
    }
    return result;
}

int main(int argc, char **argv)
{
    assert(reduce_min("BAABCCCBBA", 3) == "B");
    assert(reduce_min("AABBAAAC", 3) == "AABBC");
    assert(reduce_min("AAAA", 3) == "");
    assert(reduce_min("AAAABBBAC", 3) == "C");
    return 0;
}

如果您可以使用C ++ 17,则可以使用string views来保存内存。

编辑2:关于算法的复杂性。弄清楚并不是直截了当的,正如我所说的那样算法比任何事情都简单,但让我们看看。最后,它与广度优先搜索大致相同。假设字符串长度为n,并且,为了一般性,假设最小块长度(问题中的值3)为m。在第一级,我们可以在最坏的情况下减少n / m次减少。对于其中的每一项,我们最多可以生成(n - m) / m次减少,依此类推。所以基本上,在“级别”i(循环迭代i),我们为每个字符串创建最多(n - i * m) / m个减少,并且每个字符串都需要O(n - i * m)时间来处理。我们可以拥有的最大级别数是n / m。所以算法的复杂性(如果我没有犯错)应该有以下形式:

O( sum {i = 0 .. n / m} ( O(n - i * m) * prod {j = 0 .. i} ((n - i * m) / m) ))
       |-Outer iters--|   |---Cost---|        |-Prev lvl-| |---Branching---|

呼。所以这应该是这样的:

O( sum {i = 0 .. n / m} (n - i * m) * O(n^i / m^i) )

反过来会崩溃到:

O((n / m)^(n / m))

所以是的,算法或多或少都很简单,但它可以遇到指数成本的情况(坏的情况是完全由m制成的字符串 - 长块,如AAABBBCCCAAACCC... for { {1}} = 3)。

答案 4 :(得分:1)

另一个scala答案,使用memoization和tailcall优化(部分)(更新)

import scala.collection.mutable.HashSet
import scala.annotation._

object StringCondense extends App {

  @tailrec
  def groupConsecutive (s: String, sofar: List[String]): List[String] = s.toList match {
  // def groupConsecutive (s: String): List[String] = s.toList match {
      case Nil => sofar
      // case Nil => Nil
      case c :: str => {
          val (prefix, rest) = (c :: str).span (_ == c)
          // Strings of equal characters, longer than 3, don't make a difference to just 3
          groupConsecutive (rest.mkString(""), (prefix.take (3)).mkString ("") :: sofar)
          // (prefix.take (3)).mkString ("") :: groupConsecutive (rest.mkString(""))
      }
  }
  // to count the effect of memoization
  var count = 0

// recursively try to eliminate every group of 3 or more, brute forcing
// but for "aabbaabbaaabbbaabb", many reductions will lead sooner or 
// later to the same result, so we try to detect these and avoid duplicate 
// work 

  def moreThan2consecutive (s: String, seenbefore: HashSet [String]): String = {
      if (seenbefore.contains (s)) s
      else
      {
        count += 1
        seenbefore += s
        val sublists = groupConsecutive (s, Nil)
        // val sublists = groupConsecutive (s)
        val atLeast3 = sublists.filter (_.size > 2)
        atLeast3.length match {
            case 0 => s
            case 1 => {
                val res = sublists.filter (_.size < 3)
                moreThan2consecutive (res.mkString (""), seenbefore)
            }
            case _ => {
                val shrinked = (
                    for {idx <- (0 until sublists.size)
                    if (sublists (idx).length >= 3)
                    pre = (sublists.take (idx)).mkString ("")
                    post= (sublists.drop (idx+1)).mkString ("")
                    } yield {
                        moreThan2consecutive (pre + post, seenbefore)
                    }
                )
                (shrinked.head /: shrinked.tail) ((a, b) => if (a.length <= b.length) a else b)
            }
        }
      }
  }

    // don't know what Rcd means, adopted from other solution but modified 
    // kind of a unit test **update**: forgot to reset count 
    testRcd (s: String, expected: String) : Boolean = {
        count = 0
        val seenbefore = HashSet [String] ()
        val result = moreThan2consecutive (s, seenbefore)
        val hit = result.equals (expected)
        println (s"Input: $s\t result: ${result}\t expected ${expected}\t $hit\t count: $count");
        hit
    }

    // some test values from other users with expected result
    // **upd:** more testcases
    def testgroup () : Unit = {
        testRcd ("baabcccbba", "b")
        testRcd ("aabbaaac", "aabbc")
        testRcd ("aaaa", "")
        testRcd ("aaaabbbac", "c")

        testRcd ("abcccbcccbacccab", "b")
        testRcd ("AAAABBBAC", "C")
        testRcd ("CAAAABBBA", "C")
        testRcd ("AABBAAAC", "AABBC")
        testRcd ("BAABCCCBBA", "B")

        testRcd ("AAABBBAAABBBAAABBBC", "C")    // 377 subcalls reported by Yola,
        testRcd ("AAABBBAAABBBAAABBBAAABBBC", "C")  // 4913 when preceeded with AAABBB
    }
    testgroup

    def testBigs () : Unit = {
    /*
      testRcd ("BCBCCBCCBCABBACCBABAABBBABBBACCBBBAABBACBCCCACABBCAABACBBBBCCCBBAACBAABACCBBCBBAABCCCCCAABBBBACBBAAACACCBCCBBBCCCCCCCACBABACCABBCBBBBBCBABABBACCAACBCBBAACBBBBBCCBABACBBABABAAABCCBBBAACBCACBAABAAAABABB",
               "BCBCCBCCBCABBACCBABCCAABBACBACABBCAABACAACBAABACCBBCBBCACCBACBABACCABBCCBABABBACCAACBCBBAABABACBBABABBCCAACBCACBAABBABB")
    */
    testRcd ("CBBACAAAAABBBBCAABBCBAABBBCBCBCACACBAABCBACBBABCABACCCCBACBCBBCBACBBACCCBAAAACACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBCCCABACABBCABBAAAAABBBBAABAABBCACACABBCBCBCACCCBABCAACBCAAAABCBCABACBABCABCBBBBABCBACABABABCCCBBCCBBCCBAAABCABBAAABBCAAABCCBAABAABCAACCCABBCAABCBCBCBBAACCBBBACBBBCABAABCABABABABCA",
             "CBBACCAABBCBAACBCBCACACBAABCBACBBABCABABACBCBBCBACBBABCACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBABACABBCBBCACACABBCBCBCABABCAACBCBCBCABACBABCABCABCBACABABACCBBCCBBCACBCCBAABAABCBBCAABCBCBCBBAACCACCABAABCABABABABCA")
    /*testRcd ("AADBDBEBBBBCABCEBCDBBBBABABDCCBCEBABADDCABEEECCECCCADDACCEEAAACCABBECBAEDCEEBDDDBAAAECCBBCEECBAEBEEEECBEEBDACDDABEEABEEEECBABEDDABCDECDAABDAEADEECECEBCBDDAEEECCEEACCBBEACDDDDBDBCCAAECBEDAAAADBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDAEEEBBBCEDECBCABDEDEBBBABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCECCCA",
               "AADBDBECABCEBCDABABDCCBCEBABADDCABCCEADDACCEECCABBECBAEDCEEBBECCBBCEECBAEBCBEEBDACDDABEEABCBABEDDABCDECDAABDAEADEECECEBCBDDACCEEACCBBEACBDBCCAAECBEDDBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDACEDECBCABDEDEABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCEA")
    */
  }

  // for generated input, but with fixed seed, to compare the count with
    // and without memoization
    import util.Random
    val r = new Random (31415)

    // generate Strings but with high chances to produce some triples and 
    // longer sequences of char clones    
    def genRandomString () : String = {
      (1 to 20).map (_ => r.nextInt (6) match {
        case 0 => "t"
        case 1 => "r"
        case 2 => "-"
        case 3 => "tt"
        case 4 => "rr"
        case 5 => "--"
      }).mkString ("")
    }

    def testRandom () : Unit = {
      (1 to 10).map (i=> testRcd (genRandomString, "random mode - false might be true"))
    }

    testRandom

  testgroup
  testRandom
  // testBigs
}

比较记忆的效果导致有趣的结果:

更新衡量标准。在旧的价值观中,我忘了重置计数器,这导致更高的结果。现在结果传播 更令人印象深刻,总的来说,价值更小。

No seenbefore:
Input: baabcccbba    result: b   expected b  true    count: 4
Input: aabbaaac      result: aabbc   expected aabbc  true    count: 2
Input: aaaa      result:     expected    true    count: 2
Input: aaaabbbac     result: c   expected c  true    count: 5
Input: abcccbcccbacccab  result: b   expected b  true    count: 34
Input: AAAABBBAC     result: C   expected C  true    count: 5
Input: CAAAABBBA     result: C   expected C  true    count: 5
Input: AABBAAAC      result: AABBC   expected AABBC  true    count: 2
Input: BAABCCCBBA    result: B   expected B  true    count: 4
Input: AAABBBAAABBBAAABBBC  res: C   expected C  true    count: 377
Input: AAABBBAAABBBAAABBBAAABBBC r: C    expected C  true    count: 4913
Input: r--t----ttrrrrrr--tttrtttt--rr----result: rr--rr     expected ? unknown ?     false   count: 1959
Input: ttrtt----tr---rrrtttttttrtr--rr   result: r--rr      expected ? unknown ?     false   count: 213
Input: tt----r-----ttrr----ttrr-rr--rr-- result: ttrttrrttrr-rr--rr-- ex ? unknown ?     false   count: 16
Input: --rr---rrrrrrr-r--rr-r--tt--rrrrr result: rr-r--tt-- expected ? unknown ?     false   count: 32
Input: tt-rrrrr--r--tt--rrtrrr-------    result: ttr--tt--rrt   expected ? unknown ?     false   count: 35
Input: --t-ttt-ttt--rrrrrt-rrtrttrr  result: --tt-rrtrttrr  expected ? unknown ?     false   count: 35
Input: rrt--rrrr----trrr-rttttrrtttrr    result: rrtt-      expected ? unknown ?     false   count: 1310
Input: ---tttrrrrrttrrttrr---tt-----tt   result: rrttrr     expected ? unknown ?     false   count: 1011
Input: -rrtt--rrtt---t-r--r---rttr--     result: -rrtt--rr-r--rrttr-- ex ? unknown ?     false   count: 9
Input: rtttt--rrrrrrrt-rrttt--tt--t  result: r--t-rr--tt--t  expectd ? unknown ?     false   count: 16
real    0m0.607s    (without testBigs)
user    0m1.276s
sys 0m0.056s

With seenbefore:
Input: baabcccbba    result: b   expected b  true    count: 4
Input: aabbaaac      result: aabbc   expected aabbc  true    count: 2
Input: aaaa      result:     expected    true    count: 2
Input: aaaabbbac     result: c   expected c  true    count: 5
Input: abcccbcccbacccab  result: b   expected b  true    count: 11
Input: AAAABBBAC     result: C   expected C  true    count: 5
Input: CAAAABBBA     result: C   expected C  true    count: 5
Input: AABBAAAC      result: AABBC   expected AABBC  true    count: 2
Input: BAABCCCBBA    result: B   expected B  true    count: 4
Input: AAABBBAAABBBAAABBBC rest: C   expected C  true    count: 28
Input: AAABBBAAABBBAAABBBAAABBBC C   expected C  true    count: 52
Input: r--t----ttrrrrrr--tttrtttt--rr----result: rr--rr     expected ? unknown ?     false   count: 63
Input: ttrtt----tr---rrrtttttttrtr--rr   result: r--rr      expected ? unknown ?     false   count: 48
Input: tt----r-----ttrr----ttrr-rr--rr-- result: ttrttrrttrr-rr--rr-- xpe? unknown ?     false   count: 8
Input: --rr---rrrrrrr-r--rr-r--tt--rrrrr result: rr-r--tt-- expected ? unknown ?     false   count: 19
Input: tt-rrrrr--r--tt--rrtrrr-------    result: ttr--tt--rrt   expected ? unknown ?     false   count: 12
Input: --t-ttt-ttt--rrrrrt-rrtrttrr  result: --tt-rrtrttrr  expected ? unknown ?     false   count: 16
Input: rrt--rrrr----trrr-rttttrrtttrr    result: rrtt-      expected ? unknown ?     false   count: 133
Input: ---tttrrrrrttrrttrr---tt-----tt   result: rrttrr     expected ? unknown ?     false   count: 89
Input: -rrtt--rrtt---t-r--r---rttr--     result: -rrtt--rr-r--rrttr-- ex ? unknown ?     false   count: 6
Input: rtttt--rrrrrrrt-rrttt--tt--t  result: r--t-rr--tt--t expected ? unknown ?     false   count: 8
real    0m0.474s    (without testBigs)
user    0m0.852s
sys 0m0.060s

With tailcall:
real    0m0.478s    (without testBigs)
user    0m0.860s
sys 0m0.060s

对于一些随机字符串,差异大于10倍。

对于包含许多组的长字符串,作为改进,可以消除所有组中唯一的组,例如:

aa bbb aa ccc xx ddd aa eee aa fff xx 

bbb,ccc,ddd,eee和fff这些组在字符串中是唯一的,所以它们不适合其他东西并且都可以被删除,并且删除顺序无关紧要。这将导致中间结果

aaaa xx aaaa xx 

快速解决方案。也许我也尝试实现它。但是,我猜,有可能产生随机字符串,这会产生很大的影响,并且会产生一种不同形式的随机生成字符串,而影响很小的分布。