Question

在网页抓取期间以及在删除所有html标签之后，我在unicode（☎）中获得了黑色电话字符\ u260e。但与this response不同，我也希望摆脱它。

我在Scrapy中使用了以下正则表达式来消除html标签：

pattern = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)

然后我试图匹配\ u260e，我想我被the backslash plague抓住了。我尝试过这种模式失败了：

pattern = re.compile("<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\\\u260e",re.DOTALL|re.M)

这些都不起作用，我仍然有输出作为输出。我怎么能让它消失？

Answer 1

使用Python 2.7.3，以下工作正常：

import re

pattern = re.compile(u"<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)

输出：

u'bla ble  blo'

正如@Zack所指出的，这是因为字符串现在是unicode，即字符串已经被转换，字符序列\u260e现在是 - 可能 - 两个用来写那个小黑手机☎的字节（：

一旦要搜索的字符串和正则表达式都有黑色电话本身，而不是字符序列\u260e，它们都匹配。

Answer 2

如果您的字符串已经是unicode，那么有两种简单的方法。显然，第二个影响的不仅仅是☎。

>>> import string                                   
>>> foo = u"Lorum ☎ Ipsum"                          
>>> foo.replace(u'☎', '')                           
u'Lorum  Ipsum'                                     
>>> "".join(s for s in foo if s in string.printable)
u'Lorum  Ipsum'

Remove non-ascii characters but leave periods and spaces了解有关string.printable
The SHORTEST way to remove multiple spaces in a string in Python如果你不想要多个空格。

Answer 3

您可以尝试使用BeautifulSoup，如here所述，使用类似

的内容

soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

如何消除☎unicode？

3 个答案: