Question

我正在通过Cracking the Coding Interview（第4版），其中一个问题如下：

设计算法并编写代码以删除字符串中的重复字符不使用任何额外的缓冲区。注意：一个或两个额外的变量是好的。数组的额外副本不是。

我编写了以下解决方案，它满足了作者指定的所有测试用例：

def remove_duplicate(s):
    return ''.join(sorted(set(s)))

print(remove_duplicate("abcd")) // output "abcd"
print(remove_duplicate("aaaa")) // output "a"
print(remove_duplicate("")) // output ""
print(remove_duplicate("aabb")) // output "ab"

我在我的解决方案中使用一组是否算作使用额外的缓冲区，或者我的解决方案是否合适？如果我的解决方案不充分，那么更好的方法是什么呢？

非常感谢！

Answer 1

只有管理问题或评估答案的人才能肯定地说，但我会说一个集合确实算作缓冲。

如果字符串中没有重复的字符，则集合的长度将等于字符串的长度。实际上，由于一个集合具有很大的开销，因为它在哈希列表上工作，所以该集合可能比字符串更多地采用更多内存。如果字符串包含Unicode，则唯一字符的数量可能非常大。

如果您不知道字符串中有多少个唯一字符，您将无法预测该字符集的长度。可能长且可能不可预测的集合长度使其被视为缓冲区 - 或者更糟糕的是，考虑到可能比字符串更长的长度。

Answer 2

为了跟进v.coder的评论，我重写了他（或她）在Python中引用的代码，并添加了一些注释以试图解释发生了什么。

def removeduplicates(s):
    """Original java implementation by
          Druv Gairola (http://stackoverflow.com/users/495545/dhruv-gairola)
       in his/her answer
          http://stackoverflow.com/questions/2598129/function-to-remove-duplicate-characters-in-a-string/10473835#10473835
      """
    # python strings are immutable, so first converting the string to a list of integers,
    # each integer representing the ascii value of the letter
    # (hint: look up "ascii table" on the web)
    L = [ord(char) for char in s]

    # easiest solution is to use a set, but to use Druv Gairola's method...
    # (hint, look up "bitmaps" on the web to learn more!)
    bitmap = 0
    #seen = set()

    for index, char in enumerate(L):
        # first check for duplicates:
        # number of bits to shift left (the space is the "lowest"
        # character on the ascii table, and 'char' here is the position
        # of the current character in the ascii table. so if 'char' is
        # a space, the shift length will be 0, if 'char' is '!', shift
        # length will be 1, and so on. This naturally requires the
        # integer to actually have as many "bit positions" as there are
        # characters in the ascii table from the space to the ~,
        # but python uses "very big integers" (BigNums? I am not really
        # sure here..) - so that's probably going to be fine..
        shift_length = char - ord(' ')

        # make a new integer where only one bit is set;
        # the bit position the character corresponds to
        bit_position = 1 << shift_length

        # if the same bit is already set [to 1] in the bitmap,
        # the result of AND'ing the two integers together
        # will be an integer where that only that exact bit is
        # set - but that still means that the integer will be greater
        # than zero. (assuming that the so-called "sign bit" of the
        # integer doesn't get set. Again, I am not entirely sure about
        # how python handles integers this big internally.. but it
        # seems to work fine...)
        bit_position_already_occupied = bitmap & bit_position > 0

        if bit_position_already_occupied:
        #if char in seen:
            L[index] = 0
        else:
            # update the bitmap to indicate that this character
            # is now seen.
            # so, same procedure as above. first find the bit position
            # this character represents...
            bit_position = char - ord(' ')

            # make an integer that has a single bit set:
            # the bit that corresponds to the position of the character
            integer = 1 << bit_position

            # "add" the bit to the bitmap. The way we do this is that
            # we OR the current bitmap with the integer that has the
            # required bit set to 1. The result of OR'ing two integers
            # is that all bits that are set to 1 in *either* of the two
            # will be set to 1 in the result.

            bitmap = bitmap | integer
            #seen.add(char)

    # finally, turn the list back to a string to be able to return it
    # (again, just kind of a way to "get around" immutable python strings)
    return ''.join(chr(i) for i in L if i != 0)


if __name__ == "__main__":
    print(removeduplicates('aaaa'))
    print(removeduplicates('aabcdee'))
    print(removeduplicates('aabbccddeeefffff'))
    print(removeduplicates('&%!%)(FNAFNZEFafaei515151iaaogh6161626)([][][   ao8faeo~~~````%!)"%fakfzzqqfaklnz'))

在Python中，一组计数是否为缓冲区？

2 个答案: