我有一个Reddit机器人试图将ASCII文本转换为图像。根据{{3}},我遇到了编码特殊字符的问题。
我有一个this issue致力于此项目,但为了简洁起见,我将发布相关代码。我尝试切换到Python 3(因为我听说它比Python 2更优雅地处理Unicode),但这并没有解决问题。
此函数从Reddit中提取注释。正如你所看到的那样,我只要把它拉到utf-8就会编码,这就是为什么我很困惑。
def comments_by_keyword(r, keyword, subreddit='all', print_comments=False):
"""Fetches comments from a subreddit containing a given keyword or phrase
Args:
r: The praw.Reddit class, which is required to access the Reddit API
keyword: Keep only the comments that contain the keyword or phrase
subreddit: A string denoting the subreddit(s) to look through, default is 'all' for r/all
limit: The maximum number of posts to fetch, increase for more thoroughness at the cost of increased redundancy/running time
print_comments: (Debug option) If True, comments_by_keyword will print every comment it fetches, instead of just returning filtered ones
Returns:
An array of comment objects whose body text contains the given keyword or phrase
"""
output = []
comments = r.get_comments(subreddit, limit=1000)
for comment in comments:
# ignore the case of the keyword and comments being fetched
# Example: for keyword='RIP mobile users', comments_by_keyword would keep 'rip Mobile Users', 'rip MOBILE USERS', etc.
if keyword.lower() in comment.body.lower():
print(comment.body.encode('utf-8'))
print("=====\n")
output.append(comment)
elif print_comments:
print(comment.body.encode('utf-8'))
print("=====\n")
return output
然后将其转换为图像:
def str_to_img(str, debug=False):
"""Converts a given string to a PNG image, and saves it to the return variable"""
# use 12pt Courier New for ASCII art
font = ImageFont.truetype("cour.ttf", 12)
# do some string preprocessing
str = str.replace("\n\n", "\n") # Reddit requires double newline for new line, don't let the bot do this
str = html.unescape(str)
img = Image.new('RGB', (1,1))
d = ImageDraw.Draw(img)
str_by_line = str.split("\n")
num_of_lines = len(str_by_line)
line_widths = []
for i, line in enumerate(str_by_line):
line_widths.append(d.textsize(str_by_line[i], font=font)[0])
line_height = d.textsize(str, font=font)[1] # the height of a line of text should be unchanging
img_width = max(line_widths) # the image width is the largest of the individual line widths
img_height = num_of_lines * line_height # the image height is the # of lines * line height
# creating the output image
# add 5 pixels to account for lowercase letters that might otherwise get truncated
img = Image.new('RGB', (img_width, img_height + 5), 'white')
d = ImageDraw.Draw(img)
for i, line in enumerate(str_by_line):
d.text((0,i*line_height), line, font=font, fill='black')
output = BytesIO()
if (debug):
img.save('test.png', 'PNG')
else:
img.save(output, 'PNG')
return output
就像我说的那样,我用utf-8编码所有内容,但特殊字符没有正确显示。我也使用官方.ttf文件中的Courier New,它假设支持广泛的字符和符号,所以我不确定它是什么问题。
我觉得这很明显。任何人都可以开导我吗?它不是ImageDraw,是吗?最重要的是,整个文本编码似乎有点含糊不清,所以即使阅读了其他StackOverflow帖子(以及关于编码的博客文章),我也很难接近真正的解决方案。
答案 0 :(得分:0)
我目前无法自己进行任何测试,因为低代表我不能发表评论,所以我放弃了部分答案,希望能给出一些想法。我对Python 2也有点生疏,但试试吧..
所以有两件事。第一:
我拉动它时,我在utf-8中编码所有内容
你呢?
print(comment.body.encode('utf-8'))
print("=====\n")
output.append(comment)
您正在对打印输出进行编码,但是将原始注释附加到输出列表中,因为它是由praw输出的。 praw输出unicode对象吗?
因为我认为unicode对象是ImageDraw模块想要的。看看它的源代码,它似乎没有任何关于你试图渲染的文本的编码的线索。含义Python 2字符串可能会被渲染为单字节字符,在utf8编码的情况下导致输出中的垃圾。
http://pillow.readthedocs.org/en/latest/reference/ImageFont.html#PIL.ImageFont.truetype提到“编码”参数,默认为 unic ode。可能值得尝试设置以防万一。如果字体不兼容unicode,可能会引发错误。
Python 2中的编码并不好玩。但有一件事我仍然会尝试确保将unicode对象传递给ImageDraw(尝试unicode(str)或str.decode(“utf8”))