我有一个简单的Python脚本,可以从reddit中提取帖子并在Twitter上发布。不幸的是,今晚它开始出现我所假设的问题,因为有人在reddit上的标题有格式问题。我收到的错误是:
File "redditbot.py", line 82, in <module>
main()
File "redditbot.py", line 64, in main
tweeter(post_dict, post_ids)
File "redditbot.py", line 74, in tweeter
print post+" "+post_dict[post]+" #python"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)
这是我的剧本:
# encoding=utf8
import praw
import json
import requests
import tweepy
import time
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf8')
access_token = 'hidden'
access_token_secret = 'hidden'
consumer_key = 'hidden'
consumer_secret = 'hidden'
def strip_title(title):
if len(title) < 75:
return title
else:
return title[:74] + "..."
def tweet_creator(subreddit_info):
post_dict = {}
post_ids = []
print "[bot] Getting posts from Reddit"
for submission in subreddit_info.get_hot(limit=2000):
post_dict[strip_title(submission.title)] = submission.url
post_ids.append(submission.id)
print "[bot] Generating short link using goo.gl"
mini_post_dict = {}
for post in post_dict:
post_title = post
post_link = post_dict[post]
mini_post_dict[post_title] = post_link
return mini_post_dict, post_ids
def setup_connection_reddit(subreddit):
print "[bot] setting up connection with Reddit"
r = praw.Reddit('PythonReddit PyReTw'
'monitoring %s' %(subreddit))
subreddit = r.get_subreddit('python')
return subreddit
def duplicate_check(id):
found = 0
with open('posted_posts.txt', 'r') as file:
for line in file:
if id in line:
found = 1
return found
def add_id_to_file(id):
with open('posted_posts.txt', 'a') as file:
file.write(str(id) + "\n")
def main():
subreddit = setup_connection_reddit('python')
post_dict, post_ids = tweet_creator(subreddit)
tweeter(post_dict, post_ids)
def tweeter(post_dict, post_ids):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
for post, post_id in zip(post_dict, post_ids):
found = duplicate_check(post_id)
if found == 0:
print "[bot] Posting this link on twitter"
print post+" "+post_dict[post]+" #python"
api.update_status(post+" "+post_dict[post]+" #python")
add_id_to_file(post_id)
time.sleep(3000)
else:
print "[bot] Already posted"
if __name__ == '__main__':
main()
非常感谢任何帮助 - 提前感谢!
答案 0 :(得分:3)
考虑这个简单的程序:
print(u'\u201c' + "python")
如果你尝试打印到终端(使用适当的字符编码),你得到
“python
但是,如果您尝试将输出重定向到文件,则会得到UnicodeEncodeError
。
script.py > /tmp/out
Traceback (most recent call last):
File "/home/unutbu/pybin/script.py", line 4, in <module>
print(u'\u201c' + "python")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
当您打印到终端时,Python使用终端的字符编码来编码unicode。 (终端只能打印字节,因此必须对unicode进行编码才能打印。)
当您将输出重定向到文件时,Python无法确定字符编码,因为文件没有声明的编码。因此,默认情况下,Python2在写入文件之前使用ascii
编码隐式编码所有unicode。由于u'\u201c'
无法进行ascii编码,因此UnicodeEncodeError
。 (只有前127个unicode代码点可以用ascii编码)。
Why Print Fails wiki详细说明了这个问题。
要解决此问题,请首先避免添加unicode和字节字符串。这会导致使用Python2中的ascii编解码器进行隐式转换,以及Python3中的异常。为了面向未来的代码,最好是明确的。例如,在格式化和打印字节之前显式编码post
:
post = post.encode('utf-8')
print('{} {} #python'.format(post, post_dict[post]))
答案 1 :(得分:2)
您正在尝试将unicode字符串打印到终端(或者可能是通过IO重定向打印文件),但终端(或文件系统)使用的编码是ASCII。由于此Python尝试将其从unicode表示转换为ASCII,但由于代码点u'\u201c'
(“
)无法用ASCII表示,因此失败。您的代码实际上是这样做的:
>>> print u'\u201c'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
您可以尝试转换为UTF-8:
print (post + " " + post_dict[post] + " #python").encode('utf8')
或像这样转换为ASCII:
print (post + " " + post_dict[post] + " #python").encode('ascii', 'replace')
将用?
替换无效的ASCII字符。
另一种方法,如果您正在打印以进行调试,则可以打印字符串的repr
:
print repr(post + " " + post_dict[post] + " #python")
会输出如下内容:
>>> s = 'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'
>>> print repr(s)
u'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'
答案 2 :(得分:1)
问题可能是串联上混合字节串和unicode字符串引起的。作为使用u
为所有字符串文字添加前缀的替代方法,可能
from __future__ import unicode_literals
为您解决问题。请参阅here以获得更深入的解释,并确定它是否适合您。