Question

test_str = "Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at anу timе bеfore Арril 28? Indicаtоr: 60.76%"

print(test_str)
print(test_str.split('before '))

这是分割后得到的输出

"['Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at an\xd1\x83 tim\xd0\xb5 b\xd0\xb5fore \xd0\x90\xd1\x80ril 28? Indic\xd0\xb0t\xd0\xber: 60.76%']"

演示：https://repl.it/repls/VitalOrganicBackups

Answer 1

问题是由拉丁字母和西里尔字母混合引起的。它们在大多数策略中打印完全相同，但仍然是不同的字符，并且具有不同的代码。

问题中的输出是针对Python 2.7（原始问题提供者使用的）但在Python 3中很容易产生等效行为：

>>> print(test_str.encode('UTF8'))
b'Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at an\xd1\x83 tim\xd0\xb5 b\xd0\xb5fore \xd0\x90\xd1\x80ril 28? Indic\xd0\xb0t\xd0\xber: 60.76%'

unicodedata模块有助于更好地了解实际发生的情况：

>>> for i in b'\xd1\x83\xd0\xb5\xd0\x90\xd1\x80\xd0\xbe'.decode('utf8'):
    print(i, hex(ord(i)), i.encode('utf8'), unicodedata.name(i))

у 0x443 b'\xd1\x83' CYRILLIC SMALL LETTER U
е 0x435 b'\xd0\xb5' CYRILLIC SMALL LETTER IE
А 0x410 b'\xd0\x90' CYRILLIC CAPITAL LETTER A
р 0x440 b'\xd1\x80' CYRILLIC SMALL LETTER ER
о 0x43e b'\xd0\xbe' CYRILLIC SMALL LETTER O

因此原始文本包含西里尔字母并且为了比较，它们与它们的拉丁语相同，即使它们打印相同。这个问题与拆分无关，但只是一个糟糕的原始字符串。

Answer 2

使用“UTF-8”

解码字符串

print test_str.decode("utf-8")
u'Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at an\u0443 tim\u0435 b\u0435fore \u0410\u0440ril 28? Indic\u0430t\u043er: 60.76%'

由于它仍然有一些非ASCII字符（例如CYRILLIC SMALL LETTER U），我们可以进一步翻译它。完整列表：Cyrillic Script Wiki

使用unidecode

import unidecode
unidecode.unidecode(test_str.decode("utf-8"))
'Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at anu time before Arril 28? Indicator: 60.76%'
unidecode.unidecode(test_str.decode("utf-8")).split("before ")
['Question: The cryptocurrency Bitcoin Cash (BCH/USD) settled at 1368 USD at 07:00 AM UTC at the Bitfinex exchange on Monday, April 23. In your opinion, will BCH/USD trade above 1500 USD (+9.65%) at anu time ',
 'Arril 28? Indicator: 60.76%']

注意：如果您不想使用unidecode，我发现这篇文章详细解释了另一种方式：Transliterating non-ASCII characters with Python

为什么使用python split时字符串会改变？

2 个答案: