Question

内置的chr（）函数是为范围0 through 1,114,111(0x10FFFF)中的整数定义的，但是我需要非常相似的东西，但是对于更大范围的整数。

背景：我在线路模式下使用Google的diff_match_patch库来比较两个大型CSV文件，但我受这个数字的限制Unicode代码点由于在库中实现字符串到unicode散列的方式 - chars.append(chr(len(lineArray) - 1))。

我正在尝试构建一个包装函数，以便我能够一次散列大量的唯一行。我该怎么做呢？

  def diff_linesToChars(self, text1, text2):
    """Split two texts into an array of strings.  Reduce the texts to a string
    of hashes where each Unicode character represents one line.
    Args:
      text1: First string.
      text2: Second string.
    Returns:
      Three element tuple, containing the encoded text1, the encoded text2 and
      the array of unique strings.  The zeroth element of the array of unique
      strings is intentionally blank.
    """
    lineArray = []  # e.g. lineArray[4] == "Hello\n"
    lineHash = {}   # e.g. lineHash["Hello\n"] == 4

    # "\x00" is a valid character, but various debuggers don't like it.
    # So we'll insert a junk entry to avoid generating a null character.
    lineArray.append('')

    def diff_linesToCharsMunge(text):
      """Split a text into an array of strings.  Reduce the texts to a string
      of hashes where each Unicode character represents one line.
      Modifies linearray and linehash through being a closure.
      Args:
        text: String to encode.
      Returns:
        Encoded string.
      """
      chars = []
      # Walk the text, pulling out a substring for each line.
      # text.split('\n') would would temporarily double our memory footprint.
      # Modifying text would create many large strings to garbage collect.
      lineStart = 0
      lineEnd = -1
      while lineEnd < len(text) - 1:
        lineEnd = text.find('\n', lineStart)
        if lineEnd == -1:
          lineEnd = len(text) - 1
        line = text[lineStart:lineEnd + 1]
        lineStart = lineEnd + 1

        if line in lineHash:
          chars.append(chr(lineHash[line]))
        else:
          lineArray.append(line)
          lineHash[line] = len(lineArray) - 1
          chars.append(chr(len(lineArray) - 1))
      return "".join(chars)

    chars1 = diff_linesToCharsMunge(text1)
    chars2 = diff_linesToCharsMunge(text2)
    return (chars1, chars2, lineArray)

以下是完整的来源：diff_match_patch

将大整数值转换为十六进制字符串 - 超出chr（）

0 个答案: