需要将所有文本转换为纯文本/ ASCII(我想?)

时间:2011-08-08 14:55:52

标签: python encoding mechanize

我试图从我工作的网站上抓一个故事,当你输入网址,然后发布给我们拥有的各种新闻合作伙伴。问题是,特殊字符似乎正在给它打嗝。我正在尝试对字符串执行.replace,但它似乎没有特别好用。

无论如何强制输出是完全普通的文本,应该可以在任何地方发布吗?喜欢,没有特殊字符?

我目前的代码是:

from __future__ import division
#from __future__ import unicode_literals
from __future__ import print_function
import spynner
from mechanize import Browser
import SendKeys
from BeautifulSoup import BeautifulSoup

br = Browser()
url = "http://www.benzinga.com/trading-ideas/long-ideas/11/07/1815251/bargain-hunting-for-mid-caps-five-stocks-worth-taking-a-look-"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)

artcontent = soup.find('div', {'class': 'article-content'})

title = artcontent.find('h1', {'id': 'title'})

title = title.string

try:
    title = title.replace("'", "'")
except:
    pass

authorname = artcontent.find('div', {'class': 'node full'})
authorname = authorname.find('div', {'class': 'article-submitted'})
authorname = authorname.find('div', {'class': 'info'})
authorname = authorname.find('a')
authorname = authorname.string

story = artcontent.find('div', {'class': 'node full'})
story = story.find('div', {'class': 'content clear-block'})
story = story.findAll('p', {'class': None})

#story = [str(x).replace("<p>","\n\n").replace("</p>","") for x in story]

story = [str(x) for x in story]

storyunified = ''.join(story)

#try:
#    storyunified = storyunified.strip("\n")
#except:
#    pass
#try:
#    storyunified = storyunified.strip("\n")
#except:
#    pass

#print(storyunified)

try:
storyunified = storyunified.replace("Â", "")
except:
    pass

try:
    storyunified = storyunified.replace("â€", "\'")
except:
    pass

try:
    storyunified = storyunified.replace('“', '\"')
except:
    pass

try:
    storyunified = storyunified.replace('"', '\"')
except:
     pass

try:
    storyunified = storyunified.replace('”', '\"')
except:
    pass

try:
    storyunified = storyunified.replace("âタ", "")
except:
    pass

try:
    storyunified = storyunified.replace("â€", "")
except:
    pass

正如你所看到的,我正试图手动摆脱它们,但它似乎并不总是有效。

然后我尝试使用Spynner发布,但我不认为该代码是关键的。我发帖到福布斯博客。

2 个答案:

答案 0 :(得分:2)

请查看这篇文章,看看您是否已经熟悉它所讨论的原则:http://www.joelonsoftware.com/articles/Unicode.html

我的直觉是,您的新闻合作伙伴能够接受超出ASCII编码范围的文本。您只需要确保您的应用程序正确处理字符串和字节串,并且所有内容都应该自然地工作。

在Python 2.x中,'this text'是字节字符串,u'this text'是字符串。在Python 3.x中,'this text'是字符串,b'this text'是字节串。字节串具有.decode(encoding)方法,字符串具有.encode(encoding)方法。

祝你好运!

答案 1 :(得分:1)

前几天我在Python中使用角色编码进行摔跤。

试试这个:

import unicodedata

storyunified = unicodedata.normalize('NFKD', storyunified).encode('ascii','ignore').decode("ascii")

有一点不是它会删除有问题的字符而不是替换它们。要更改此行为,您可以将ignore更改为replace,但我尚未对此进行任何测试。