我在网页上抓住了这一段:
看起来并不像一个有争议的新案例管理系统。因此,该市计划在接下来的几个月里帮助当地的社会救助工作者学会与之共处。
在我下载的python unicode中的html数据中,它看起来像这样:
mystr = u'It doesn\u2019t look lake a controversial new case management system is going anywhere. So\xa0the city plans to spend the next few months helping local social assistance workers learn to live with it.'
我的计划是能够使用类似mystr.find("doesn't")
的内容来查找单词的位置。目前,mystr.find("doesn't")
将返回-1
,因为doesn\u2019t
mystr
有没有一种快速方法可以将mystr
转换为上面的段落,以便所有unicode'字符'都被'普通'字符替换,以便我可以使用str.find()
?
到目前为止,我在网页上找到的最佳帖子是将u'\u2019'
替换为"'"
,然后将u'\xa0'
替换为' '
。是否有更方便的方法,以便我不必真正编写方法并构建转换字典?
PS:
我也尝试过unicodedata.normalizing和类似的东西,似乎没有用。
编辑: 我忘了提一下,python版本是2.7
答案 0 :(得分:2)
您已拥有该网页包含的内容。 \u2019
是U+2019 RIGHT SINGLE QUOTATION MARK,是一个花哨的单引号,但您使用的是简单的ASCII单引号,例如低U+0027 APOSTROPHE。
如果您打印该值,您会看到它产生的内容看起来很像它有一个单引号,但略微弯曲:
>>> mystr = u'It doesn\u2019t look lake a controversial new case management system is going anywhere. So\xa0the city plans to spend the next few months helping local social assistance workers learn to live with it.'
>>> print mystr
It doesn’t look lake a controversial new case management system is going anywhere. So the city plans to spend the next few months helping local social assistance workers learn to live with it.
所有Python都回应了字符串的表示,它将任何不可打印和非ASCII的内容替换为使值可重现的转义序列;您可以将值复制并粘贴到任何Python解释器或脚本中,它将生成相同的值。因为Python的默认源编码是ASCII,所以只使用ASCII字符来描述值。
您可以改为查找该文字:
>>> u'doesn\u2019t' in mystr
True
或者您可以使用像unidecode
这样的库来替换非ASCII代码点和ASCII' lookalikes'它会用纯ASCII引号替换花哨的引用:
>>> from unidecode import unidecode
>>> unidecode(mystr)
"It doesn't look lake a controversial new case management system is going anywhere. So the city plans to spend the next few months helping local social assistance workers learn to live with it."
>>> "doesn't" in unidecode(mystr)
True
答案 1 :(得分:0)
因为它不是doesn't
而它的doesn’t
引用是unicode所以如果你使用doesn’t
python raise UnicodeDecodeError
。因此,您需要在u
doesn’t
>>> mystr.find("doesn’t")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5: ordinal not in range(128)
>>> mystr.find(u"doesn’t")
3