我想总结而不是以类似的方式压缩运行长度编码,但是嵌套意义上。
例如,我想:ABCBCABCBCDEEF成为:(2A(2BC))D(2E)F
我并不担心在两个相同的可能嵌套之间选择一个选项,例如
尽管具有不同的结构,ABBABBABBABA可以是(3ABB)ABA或A(3BBA)BA,它们具有相同的压缩长度。但是我确实希望选择最贪婪。例如:
ABCDABCDCDCDCD将选择(2ABCD)(3CD) - 原始符号长度为6,小于ABCDAB(4CD),原始符号长度为8。
就背景而言,我有一些重复的模式,我想总结一下。这样数据就更容易消化了。我不想破坏数据的逻辑顺序,因为它很重要。但是我想总结它,通过说,符号A次出现3次,其次是符号XYZ 20次出现等等,这可以用嵌套的方式在视觉上显示。
欢迎您的想法。
答案 0 :(得分:3)
我很确定这不是最好的方法,并且根据模式的长度,可能会有运行时间和内存使用不起作用,但这里有一些代码。
您可以将以下代码粘贴到LINQPad并运行它,它应该产生以下输出:
ABCBCABCBCDEEF = (2A(2BC))D(2E)F ABBABBABBABA = (3A(2B))ABA ABCDABCDCDCDCD = (2ABCD)(3CD)
正如您所看到的,中间示例将ABB
编码为A(2B)
而不是ABB
,您必须自己做出判断,如果像这样的单符号序列应该编码作为重复符号与否,或者是否应使用特定阈值(如3或更高)。
基本上,代码运行如下:
无论如何,这是代码:
void Main()
{
string[] examples = new[]
{
"ABCBCABCBCDEEF",
"ABBABBABBABA",
"ABCDABCDCDCDCD",
};
foreach (string example in examples)
{
StringBuilder sb = new StringBuilder();
foreach (var r in Encode(example))
sb.Append(r.ToString());
Debug.WriteLine(example + " = " + sb.ToString());
}
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
{
return Encode<T>(values, EqualityComparer<T>.Default);
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
{
List<T> sequence = new List<T>(values);
int index = 0;
while (index < sequence.Count)
{
var bestSequence = FindBestSequence<T>(sequence, index, comparer);
if (bestSequence == null || bestSequence.Length < 1)
throw new InvalidOperationException("Unable to find sequence at position " + index);
yield return bestSequence;
index += bestSequence.Length;
}
}
private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
{
int sequenceLength = 1;
while (startIndex + sequenceLength * 2 <= sequence.Count)
{
if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
{
bool atLeast2Repeats = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
{
atLeast2Repeats = false;
break;
}
}
if (atLeast2Repeats)
{
int count = 2;
while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
{
bool anotherRepeat = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
{
anotherRepeat = false;
break;
}
}
if (anotherRepeat)
count++;
else
break;
}
List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
return new SequenceRepeat<T>(count, repeatedSequence);
}
}
sequenceLength++;
}
// fall back, we could not find anything that repeated at all
return new SingleSymbol<T>(sequence[startIndex]);
}
public abstract class Repeat<T>
{
public int Count { get; private set; }
protected Repeat(int count)
{
Count = count;
}
public abstract int Length
{
get;
}
}
public class SingleSymbol<T> : Repeat<T>
{
public T Value { get; private set; }
public SingleSymbol(T value)
: base(1)
{
Value = value;
}
public override string ToString()
{
return string.Format("{0}", Value);
}
public override int Length
{
get
{
return Count;
}
}
}
public class SequenceRepeat<T> : Repeat<T>
{
public Repeat<T>[] Values { get; private set; }
public SequenceRepeat(int count, Repeat<T>[] values)
: base(count)
{
Values = values;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));
}
public override int Length
{
get
{
int oneLength = 0;
foreach (var value in Values)
oneLength += value.Length;
return Count * oneLength;
}
}
}
public class GroupRepeat<T> : Repeat<T>
{
public Repeat<T> Group { get; private set; }
public GroupRepeat(int count, Repeat<T> group)
: base(count)
{
Group = group;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, Group);
}
public override int Length
{
get
{
return Count * Group.Length;
}
}
}
答案 1 :(得分:1)
从理论上看问题,看起来类似于找到生成(仅)字符串的最小上下文无关语法的问题,除非在这种情况下非终端只能在彼此之后直接使用,所以例如
ABCBCABCBCDEEF s->ttDuuF t->Avv v->BC u->E ABABCDABABCD s->ABtt t->ABCD
当然,这取决于您如何定义“最小”,但如果计算规则右侧的终端,则在执行嵌套运行长度编码后,它应与“原始符号中的长度”相同。
已知最小语法的问题很难,并且是一个经过充分研究的问题。我不知道“直接序列”部分增加了多少或从复杂性中减去了多少。