How can I slice a substring from a unicode string with Python?

时间:2015-08-07 01:55:08

标签: python string unicode substring

I have a unicode string as a result : u'splunk>\xae\uf001'

How can I get the substring 'uf001'

as a simple string in python?

3 个答案:

答案 0 :(得分:2)

The characters uf001 are not actually present in the string, so you can't just slice them off. You can do

repr(s)[-6:-1]

or

'u' + hex(ord(s[-1]))[2:]

答案 1 :(得分:2)

Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -

>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€

If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -

>>> repr(a[-1])
"'\\uf001'"

\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.

You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.

答案 2 :(得分:1)

u''它是如何在Python源代码中表示Unicode字符串。 REPL默认使用此表示来显示unicode对象:

>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])


如果您的终端未配置为显示Unicode,或者您使用的是窄版本(例如,它可能适用于Windows上的Python 2),那么结果可能会有所不同。

Unicode字符串是Python中不可变的Unicode代码点序列。 len(u'\uf001') == 1:其中不包含uf001(5个字符)。您可以将其写为u''(如果您使用非ascii字符,则必须在Python 2上声明源文件的字符编码):

>>> u'\uf001' == u''
True

它只是表示完全相同的Unicode字符(在这种情况下是单个代码点)的另一种方式。

注意:一些用户感知的字符可能跨越几个 Unicode代码点,例如:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё