Question

我试图从我工作的网站上抓一个故事，当你输入网址，然后发布给我们拥有的各种新闻合作伙伴。问题是，特殊字符似乎正在给它打嗝。我正在尝试对字符串执行.replace，但它似乎没有特别好用。

无论如何强制输出是完全普通的文本，应该可以在任何地方发布吗？喜欢，没有特殊字符？

我目前的代码是：

from __future__ import division
#from __future__ import unicode_literals
from __future__ import print_function
import spynner
from mechanize import Browser
import SendKeys
from BeautifulSoup import BeautifulSoup

br = Browser()
url = "http://www.benzinga.com/trading-ideas/long-ideas/11/07/1815251/bargain-hunting-for-mid-caps-five-stocks-worth-taking-a-look-"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)

artcontent = soup.find('div', {'class': 'article-content'})

title = artcontent.find('h1', {'id': 'title'})

title = title.string

try:
    title = title.replace("&#039;", "'")
except:
    pass

authorname = artcontent.find('div', {'class': 'node full'})
authorname = authorname.find('div', {'class': 'article-submitted'})
authorname = authorname.find('div', {'class': 'info'})
authorname = authorname.find('a')
authorname = authorname.string

story = artcontent.find('div', {'class': 'node full'})
story = story.find('div', {'class': 'content clear-block'})
story = story.findAll('p', {'class': None})

#story = [str(x).replace("<p>","\n\n").replace("</p>","") for x in story]

story = [str(x) for x in story]

storyunified = ''.join(story)

#try:
#    storyunified = storyunified.strip("\n")
#except:
#    pass
#try:
#    storyunified = storyunified.strip("\n")
#except:
#    pass

#print(storyunified)

try:
storyunified = storyunified.replace("Â", "")
except:
    pass

try:
    storyunified = storyunified.replace("â€", "\'")
except:
    pass

try:
    storyunified = storyunified.replace('“', '\"')
except:
    pass

try:
    storyunified = storyunified.replace('"', '\"')
except:
     pass

try:
    storyunified = storyunified.replace('”', '\"')
except:
    pass

try:
    storyunified = storyunified.replace("âﾀ", "")
except:
    pass

try:
    storyunified = storyunified.replace("â€", "")
except:
    pass

正如你所看到的，我正试图手动摆脱它们，但它似乎并不总是有效。

然后我尝试使用Spynner发布，但我不认为该代码是关键的。我发帖到福布斯博客。

Answer 1

请查看这篇文章，看看您是否已经熟悉它所讨论的原则：http://www.joelonsoftware.com/articles/Unicode.html

我的直觉是，您的新闻合作伙伴能够接受超出ASCII编码范围的文本。您只需要确保您的应用程序正确处理字符串和字节串，并且所有内容都应该自然地工作。

在Python 2.x中，'this text'是字节字符串，u'this text'是字符串。在Python 3.x中，'this text'是字符串，b'this text'是字节串。字节串具有.decode(encoding)方法，字符串具有.encode(encoding)方法。

祝你好运！

Answer 2

前几天我在Python中使用角色编码进行摔跤。

试试这个：

import unicodedata

storyunified = unicodedata.normalize('NFKD', storyunified).encode('ascii','ignore').decode("ascii")

有一点不是它会删除有问题的字符而不是替换它们。要更改此行为，您可以将ignore更改为replace，但我尚未对此进行任何测试。

需要将所有文本转换为纯文本/ ASCII（我想？）

2 个答案: