我试图从我工作的网站上抓一个故事,当你输入网址,然后发布给我们拥有的各种新闻合作伙伴。问题是,特殊字符似乎正在给它打嗝。我正在尝试对字符串执行.replace,但它似乎没有特别好用。
无论如何强制输出是完全普通的文本,应该可以在任何地方发布吗?喜欢,没有特殊字符?
我目前的代码是:
from __future__ import division
#from __future__ import unicode_literals
from __future__ import print_function
import spynner
from mechanize import Browser
import SendKeys
from BeautifulSoup import BeautifulSoup
br = Browser()
url = "http://www.benzinga.com/trading-ideas/long-ideas/11/07/1815251/bargain-hunting-for-mid-caps-five-stocks-worth-taking-a-look-"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
artcontent = soup.find('div', {'class': 'article-content'})
title = artcontent.find('h1', {'id': 'title'})
title = title.string
try:
title = title.replace("'", "'")
except:
pass
authorname = artcontent.find('div', {'class': 'node full'})
authorname = authorname.find('div', {'class': 'article-submitted'})
authorname = authorname.find('div', {'class': 'info'})
authorname = authorname.find('a')
authorname = authorname.string
story = artcontent.find('div', {'class': 'node full'})
story = story.find('div', {'class': 'content clear-block'})
story = story.findAll('p', {'class': None})
#story = [str(x).replace("<p>","\n\n").replace("</p>","") for x in story]
story = [str(x) for x in story]
storyunified = ''.join(story)
#try:
# storyunified = storyunified.strip("\n")
#except:
# pass
#try:
# storyunified = storyunified.strip("\n")
#except:
# pass
#print(storyunified)
try:
storyunified = storyunified.replace("Â", "")
except:
pass
try:
storyunified = storyunified.replace("â€", "\'")
except:
pass
try:
storyunified = storyunified.replace('“', '\"')
except:
pass
try:
storyunified = storyunified.replace('"', '\"')
except:
pass
try:
storyunified = storyunified.replace('”', '\"')
except:
pass
try:
storyunified = storyunified.replace("âタ", "")
except:
pass
try:
storyunified = storyunified.replace("â€", "")
except:
pass
正如你所看到的,我正试图手动摆脱它们,但它似乎并不总是有效。
然后我尝试使用Spynner发布,但我不认为该代码是关键的。我发帖到福布斯博客。
答案 0 :(得分:2)
请查看这篇文章,看看您是否已经熟悉它所讨论的原则:http://www.joelonsoftware.com/articles/Unicode.html
我的直觉是,您的新闻合作伙伴能够接受超出ASCII编码范围的文本。您只需要确保您的应用程序正确处理字符串和字节串,并且所有内容都应该自然地工作。
在Python 2.x中,'this text'
是字节字符串,u'this text'
是字符串。在Python 3.x中,'this text'
是字符串,b'this text'
是字节串。字节串具有.decode(encoding)
方法,字符串具有.encode(encoding)
方法。
答案 1 :(得分:1)
前几天我在Python中使用角色编码进行摔跤。
试试这个:
import unicodedata
storyunified = unicodedata.normalize('NFKD', storyunified).encode('ascii','ignore').decode("ascii")
有一点不是它会删除有问题的字符而不是替换它们。要更改此行为,您可以将ignore
更改为replace
,但我尚未对此进行任何测试。