内置的chr()函数是为范围0 through 1,114,111(0x10FFFF)
中的整数定义的,但是我需要非常相似的东西,但是对于更大范围的整数。
背景:我在线路模式下使用Google的diff_match_patch
库来比较两个大型CSV文件,但我受这个数字的限制Unicode代码点由于在库中实现字符串到unicode散列的方式 - chars.append(chr(len(lineArray) - 1))
。
我正在尝试构建一个包装函数,以便我能够一次散列大量的唯一行。我该怎么做呢?
def diff_linesToChars(self, text1, text2):
"""Split two texts into an array of strings. Reduce the texts to a string
of hashes where each Unicode character represents one line.
Args:
text1: First string.
text2: Second string.
Returns:
Three element tuple, containing the encoded text1, the encoded text2 and
the array of unique strings. The zeroth element of the array of unique
strings is intentionally blank.
"""
lineArray = [] # e.g. lineArray[4] == "Hello\n"
lineHash = {} # e.g. lineHash["Hello\n"] == 4
# "\x00" is a valid character, but various debuggers don't like it.
# So we'll insert a junk entry to avoid generating a null character.
lineArray.append('')
def diff_linesToCharsMunge(text):
"""Split a text into an array of strings. Reduce the texts to a string
of hashes where each Unicode character represents one line.
Modifies linearray and linehash through being a closure.
Args:
text: String to encode.
Returns:
Encoded string.
"""
chars = []
# Walk the text, pulling out a substring for each line.
# text.split('\n') would would temporarily double our memory footprint.
# Modifying text would create many large strings to garbage collect.
lineStart = 0
lineEnd = -1
while lineEnd < len(text) - 1:
lineEnd = text.find('\n', lineStart)
if lineEnd == -1:
lineEnd = len(text) - 1
line = text[lineStart:lineEnd + 1]
lineStart = lineEnd + 1
if line in lineHash:
chars.append(chr(lineHash[line]))
else:
lineArray.append(line)
lineHash[line] = len(lineArray) - 1
chars.append(chr(len(lineArray) - 1))
return "".join(chars)
chars1 = diff_linesToCharsMunge(text1)
chars2 = diff_linesToCharsMunge(text2)
return (chars1, chars2, lineArray)
以下是完整的来源:diff_match_patch