Question

使用以下代码，我在mac和ubuntu上得到不同的索引值。两者都是64位机器并运行python 2.7.8。 messages.json文件的字符串在开头有一些utf-8字符。该文件的内容是：

  #Bangalore fine dinning table bookings in best price ⚡⚡⚡⚡⚡⚡⚡⚡⚡

python代码如下：

import re

f = open('messages.json', 'r')
text = f.read().decode('UTF-8')
f.close()

print type(text)

for m in re.finditer('#Bangalore', text): 
    s = m.start()
    e = m.end()
    print s, e
    print text[s:e]

在Ubuntu上：

<type 'unicode'>
11 21
#Bangalore

在Mac上：

<type 'unicode'>
20 30
#Bangalore

Answer 1

问题是你的字符串包含的代码点大于0xFFFF（＆＃34;星号＆＃34;字符）。 Python（3.3之前的版本）有两个版本：＆＃34; narrow＆＃34;和＆＃34;宽＆＃34;。窄版本仅支持16位unicodes，并且需要两个用于astrals的单元：

Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
65535
>>> s = u'#Bangalore'
>>> s.index('#')
2

＆＃34;宽＆＃34; build使用32位并用一个单位表示所有unicode字符：

Python 2.7.2+ (default, Jul 20 2012, 22:15:08) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> s = u'#Bangalore'
>>> s.index('#')
1

可能的解决方法是

使用现代Python
install a wide python on OSX
重写代码，使其不需要绝对位置

string字符串中utf-8字的索引

1 个答案: