Question

我有一个由字符A，B，C和D组成的字符串，我正在尝试计算最长子字符串的长度，这些子字符串的任何顺序中每个字符的数量相等。例如，ABCDB将返回4，ABCC 0和ADDBCCBA 8。

目前我的代码：

    public int longestSubstring(String word) {
        HashMap<Integer, String> map = new HashMap<Integer, String>();

        for (int i = 0; i<word.length()-3; i++) {
            map.put(i, word.substring(i, i+4));
        }

        StringBuilder sb;   

        int longest = 0;

        for (int i = 0; i<map.size(); i++) {
            sb = new StringBuilder();
            sb.append(map.get(i));

            int a = 4;

            while (i<map.size()-a) {
                sb.append(map.get(i+a));
                a+= 4;
            }

            String substring = sb.toString();

            if (equalAmountOfCharacters(substring)) {
                int length = substring.length();
                if (length > longest)
                    longest = length;
            }

        }

    return longest;
}

如果字符串长度为10 ^ 4，那么目前效果非常好，但是我试图将它设为10 ^ 5。任何提示或建议将不胜感激。

Answer 1

如果只有A和B，那么你可以这样做。

def longest_balanced(word):
    length = 0
    cumulative_difference = 0
    first_index = {0: -1}
    for index, letter in enumerate(word):
        if letter == 'A':
            cumulative_difference += 1
        elif letter == 'B':
            cumulative_difference -= 1
        else:
            raise ValueError(letter)
        if cumulative_difference in first_index:
            length = max(length, index - first_index[cumulative_difference])
        else:
            first_index[cumulative_difference] = index
    return length

所有四个字母的生活都比较复杂，但这个想法大致相同。对于A和B，我们不是仅保留一个累积差异，而是保留三个，对于A＆B，A＆C，A和C，和A＆D＆＃39; s。

Answer 2

好吧，首先，不要构造任何字符串如果你不生产任何（或几乎没有）垃圾，就不需要收集它，这是一个主要的优点。

接下来，使用不同的数据结构：

我建议使用4个字节数组，从相应的字符串索引开始，在4个跨度中存储各自符号的计数。
这应该会加速它。

Answer 3

假设cnt(c, i)是长度c前缀中字符i的出现次数。
子字符串(low, high]具有等量的两个字符a和b iff cnt(a, high) - cnt(a, low) = cnt(b, high) - cnt(b, low)，或换句话说，cnt(b, high) - cnt(a, high) = cnt(b, low) - cnt(a, low) 。因此，每个位置由值cnt(b, i) - cnt(a, i)描述。现在我们可以将它概括为两个以上的字符：每个位置由元组(cnt(a_2, i) - cnt(a_1, i), ..., cnt(a_k, i) - cnt(a_1, i))描述，其中a_1 ... a_k是字母表。
我们可以迭代给定的字符串并维护当前元组。在每一步，我们都应该通过检查i - first_occurrence(current_tuple)的值来更新答案，其中first_occurrence是一个哈希表，用于存储到目前为止看到的每个元组的第一次出现。不要忘记在迭代之前将一个零元组放入哈希映射（它对应于一个空前缀）。

Answer 4

您可能希望缓存每个String索引的累积字符数 - 这就是真正的瓶颈所在。没有经过彻底的测试，但下面的内容应该有效。

public class Test {
    static final int LEN = 4;

    static class RandomCharSequence implements CharSequence {
        private final Random mRandom = new Random();
        private final int mAlphabetLen;
        private final int mLen;
        private final int mOffset;

        RandomCharSequence(int pLen, int pOffset, int pAlphabetLen) {
            mAlphabetLen = pAlphabetLen;
            mLen = pLen;
            mOffset = pOffset;
        }

        public int length() {return mLen;}

        public char charAt(int pIdx) {
            mRandom.setSeed(mOffset + pIdx);
            return (char) (
                'A' +
                 (mRandom.nextInt() % mAlphabetLen + mAlphabetLen) % mAlphabetLen
            );
        }

        public CharSequence subSequence(int pStart, int pEnd) {
            return new RandomCharSequence(pEnd - pStart, pStart, mAlphabetLen);
        }

        @Override public String toString() {
            return (new StringBuilder(this)).toString();
        }
    }

    public static void main(String[] pArgs) {
        Stream.of("ABCDB", "ABCC", "ADDBCCBA", "DADDBCCBA").forEach(
            pWord -> System.out.println(longestSubstring(pWord))
        );

        for (int i = 0; ; i++) {
            final double len = Math.pow(10, i);
            if (len >= Integer.MAX_VALUE) break;

            System.out.println("Str len 10^" + i);
            for (int alphabetLen = 1; alphabetLen <= LEN; alphabetLen++) {
                final Instant start = Instant.now();
                final int val = longestSubstring(
                    new RandomCharSequence((int) len, 0, alphabetLen)
                );

                System.out.println(
                    String.format(
                        "  alphabet len %d; result %08d; time %s",
                        alphabetLen,
                        val,
                        formatMillis(ChronoUnit.MILLIS.between(start, Instant.now()))
                    )
                );
            }
        }
    }

    static String formatMillis(long millis) {
        return String.format(
            "%d:%02d:%02d.%03d",
            TimeUnit.MILLISECONDS.toHours(millis),
            TimeUnit.MILLISECONDS.toMinutes(millis) -
             TimeUnit.HOURS.toMinutes(TimeUnit.MILLISECONDS.toHours(millis)),
            TimeUnit.MILLISECONDS.toSeconds(millis) -
             TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(millis)),
            TimeUnit.MILLISECONDS.toMillis(millis) -
             TimeUnit.SECONDS.toMillis(TimeUnit.MILLISECONDS.toSeconds(millis))
        );
    }

    static int longestSubstring(CharSequence pWord) {
        // create array that stores cumulative char counts at each index of string
        // idx 0 = char (A-D); idx 1 = offset
        final int[][] cumulativeCnts = new int[LEN][];
        for (int i = 0; i < LEN; i++) {
            cumulativeCnts[i] = new int[pWord.length() + 1];
        }

        final int[] cumulativeCnt = new int[LEN];

        for (int i = 0; i < pWord.length(); i++) {
            cumulativeCnt[pWord.charAt(i) - 'A']++;
            for (int j = 0; j < LEN; j++) {
                cumulativeCnts[j][i + 1] = cumulativeCnt[j];
            }
        }

        final int maxResult = Arrays.stream(cumulativeCnt).min().orElse(0) * LEN;
        if (maxResult == 0) return 0;

        int result = 0;
        for (int initialOffset = 0; initialOffset < LEN; initialOffset++) {
            for (
                int start = initialOffset;
                start < pWord.length() - result;
                start += LEN
            ) {
                endLoop:
                for (
                    int end = start + result + LEN;
                    end <= pWord.length() && end - start <= maxResult;
                    end += LEN
                ) {
                    final int substrLen = end - start;
                    final int expectedCharCnt = substrLen / LEN;
                    for (int i = 0; i < LEN; i++) {
                        if (
                            cumulativeCnts[i][end] - cumulativeCnts[i][start] !=
                             expectedCharCnt
                        ) {
                            continue endLoop;
                        }
                    }
                    if (substrLen > result) result = substrLen;
                }
            }
        }
        return result;
    }
}

Answer 5

您可以计算word中字符的出现次数。然后，一个可能的解决方案可能是：

如果min是word中任何字符的最小出现次数，则min也是我们要查找的子字符串中每个字符的最大出现次数。在下面的代码中，min为maxCount。
我们迭代maxCount的递减值。在每一步中，我们要搜索的字符串的长度为maxCount * alphabetSize。我们可以将其视为滑动窗口的大小，我们可以在word上滑动。
我们将窗口滑过word，计算窗口中字符的出现次数。如果窗口是我们要搜索的子字符串，则返回结果。否则，我们会继续搜索。

[已修复] 代码：

private static final int ALPHABET_SIZE = 4;

public int longestSubstring(String word) {
    // count
    int[] count = new int[ALPHABET_SIZE];
    for (int i = 0; i < word.length(); i++) {
        char c = word.charAt(i);
        count[c - 'A']++;
    }
    int maxCount = word.length();
    for (int i = 0; i < count.length; i++) {
        int cnt = count[i];
        if (cnt < maxCount) {
            maxCount = cnt;
        }
    }
    // iterate over maxCount until found
    boolean found = false;
    while (maxCount > 0 && !found) {
        int substringLength = maxCount * ALPHABET_SIZE;
        found = findSubstring(substringLength, word, maxCount);
        if (!found) {
            maxCount--;
        }
    }
    return found ? maxCount * ALPHABET_SIZE : 0;
}

private boolean findSubstring(int length, String word, int maxCount) {
    int startIndex = 0;
    boolean found = false;
    while (startIndex + length <= word.length()) {
        int[] count = new int[ALPHABET_SIZE];
        for (int i = startIndex; i < startIndex + length; i++) {
            char c = word.charAt(i);
            int cnt = ++count[c - 'A'];
            if (cnt > maxCount) {
                break;
            }
        }
        if (equalValues(count, maxCount)) {
            found = true;
            break;
        } else {
            startIndex++;
        }
    }
    return found;
}

// Returns true if all values in c are equal to value
private boolean equalValues(int[] count, int value) {
    boolean result = true;
    for (int i : count) {
        if (i != value) {
            result = false;
            break;
        }
    }
    return result;
}

[已合并] 这是Hollis Waite使用累计计数的解决方案，但考虑了我在第1点和第2点的观察结果。这可能会提高某些输入的性能：

private static final int ALPHABET_SIZE = 4;

public int longestSubstring(String word) {
    // count
    int[][] cumulativeCount = new int[ALPHABET_SIZE][];
    for (int i = 0; i < ALPHABET_SIZE; i++) {
        cumulativeCount[i] = new int[word.length() + 1];
    }
    int[] count = new int[ALPHABET_SIZE];
    for (int i = 0; i < word.length(); i++) {
        char c = word.charAt(i);
        count[c - 'A']++;
        for (int j = 0; j < ALPHABET_SIZE; j++) {
            cumulativeCount[j][i + 1] = count[j];
        }
    }
    int maxCount = word.length();
    for (int i = 0; i < count.length; i++) {
        int cnt = count[i];
        if (cnt < maxCount) {
            maxCount = cnt;
        }
    }
    // iterate over maxCount until found
    boolean found = false;
    while (maxCount > 0 && !found) {
        int substringLength = maxCount * ALPHABET_SIZE;
        found = findSubstring(substringLength, word, maxCount, cumulativeCount);
        if (!found) {
            maxCount--;
        }
    }
    return found ? maxCount * ALPHABET_SIZE : 0;
}

private boolean findSubstring(int length, String word, int maxCount, int[][] cumulativeCount) {
    int startIndex = 0;
    int endIndex = (startIndex + length) - 1;
    boolean found = true;
    while (endIndex < word.length()) {
        for (int i = 0; i < ALPHABET_SIZE; i++) {
            if (cumulativeCount[i][endIndex] - cumulativeCount[i][startIndex] != maxCount) {
                found = false;
                break;
            }
        }
        if (found) {
            break;
        } else {
            startIndex++;
            endIndex++;
        }
    }
    return found;
}

Answer 6

假设长度为N的字符串中有K个可能的字母。我们可以跟踪用长度为K的向量 pos 看到的字母平衡，其更新如下：

如果看到字母1，请添加（K-1，-1，-1，...）
如果看到字母2，请添加（-1，K-1，-1，...）
如果看到字母3，请添加（-1，-1，K-1，...）

维护一个散列，将pos映射到到达pos的第一个字符串位置。只要hash [pos]已经存在且子串值为s [hash [pos]：pos]，就会出现平衡的子串。

维护散列的成本是O（log N），因此处理字符串需要O（N log N）。到目前为止，这与解决方案相比如何？这些类型的问题往往有线性解决方案，但我还没有遇到过。

这里有一些代码演示了3个字母和使用偏置随机字符串运行的想法。（统一的随机字符串允许大约一半字符串长度的解决方案，这对于打印来说是笨重的。）

#!/usr/bin/python
import random
from time import time

alphabet = "abc"
DIM = len(alphabet)


def random_string(n):
    # return a random string over choices[] of length n
    # distribution of letters is non-uniform to make matches harder to find
    choices = "aabbc"
    s = ''
    for i in range(n):
        r = random.randint(0, len(choices) - 1)
        s += choices[r]
    return s


def validate(s):
    # verify frequencies of each letter are the same
    f = [0, 0, 0]
    a2f = {alphabet[i] : i for i in range(DIM)}
    for c in s:
        f[a2f[c]] += 1
    assert f[0] == f[1] and f[1] == f[2]


def longest_balanced(s):
    """return length of longest substring of s containing equal
       populations of each letter in alphabet"""

    slen = len(s)
    p = [0 for i in range(DIM)]
    vec = {alphabet[0] : [2, -1, -1],
           alphabet[1] : [-1, 2, -1],
           alphabet[2] : [-1, -1, 2]}
    x = -1
    best = -1
    hist = {str([0, 0, 0]) : -1}

    for c in s:
        x += 1
        p = [p[i] + vec[c][i] for i in range(DIM)]
        pkey = str(p)

        if pkey not in hist:
            hist[pkey] = x
        else:
            span = x - hist[pkey]
            assert span % DIM == 0
            if span > best:
                best = span

                cand = s[hist[pkey] + 1: x + 1]
                print("best so far %d = [%d,%d]: %s" % (best,
                                                        hist[pkey] + 1,
                                                        x + 1,
                                                        cand))
                validate(cand)

    return best if best > -1 else 0


def main():
    #print longest_balanced( "aaabcabcbbcc" )

    t0 = time()
    s = random_string(1000000)
    print "generate time:", time() - t0
    t1 = time()
    best = longest_balanced( s )
    print "best:", best
    print "elapsed:", time() - t1


main()

示例在10 ^ 6个字母的输入上运行，字母为3个字母：

$ ./bal.py 
...
best so far 189 = [847894,848083]: aacacbcbabbbcabaabbbaabbbaaaacbcaaaccccbcbcbababaabbccccbbabbacabbbbbcaacacccbbaacbabcbccaabaccabbbbbababbacbaaaacabcbabcbccbabbccaccaabbcabaabccccaacccccbaacaaaccbbcbcabcbcacaabccbacccacca
best: 189
elapsed: 1.43609690666

如何有效地找到具有相同字符数量的最长子字符串

6 个答案: