Python Reddit bot没有正确编码特殊字符

时间:2015-08-05 05:47:28

标签: python bots reddit

我有一个Reddit机器人试图将ASCII文本转换为图像。根据{{​​3}},我遇到了编码特殊字符的问题。

我有一个this issue致力于此项目,但为了简洁起见,我将发布相关代码。我尝试切换到Python 3(因为我听说它比Python 2更优雅地处理Unicode),但这并没有解决问题。

此函数从Reddit中提取注释。正如你所看到的那样,我只要把它拉到utf-8就会编码,这就是为什么我很困惑。

def comments_by_keyword(r, keyword, subreddit='all', print_comments=False):
    """Fetches comments from a subreddit containing a given keyword or phrase
    Args:
        r: The praw.Reddit class, which is required to access the Reddit API
        keyword: Keep only the comments that contain the keyword or phrase
        subreddit: A string denoting the subreddit(s) to look through, default is 'all' for r/all
        limit: The maximum number of posts to fetch, increase for more thoroughness at the cost of increased redundancy/running time
        print_comments: (Debug option) If True, comments_by_keyword will print every comment it fetches, instead of just returning filtered ones
    Returns:
        An array of comment objects whose body text contains the given keyword or phrase
    """

    output = []
    comments = r.get_comments(subreddit, limit=1000)

    for comment in comments:
        # ignore the case of the keyword and comments being fetched
        # Example: for keyword='RIP mobile users', comments_by_keyword would keep 'rip Mobile Users', 'rip MOBILE USERS', etc.
        if keyword.lower() in comment.body.lower():
            print(comment.body.encode('utf-8'))
            print("=====\n")
            output.append(comment)
        elif print_comments:
            print(comment.body.encode('utf-8'))
            print("=====\n")
    return output

然后将其转换为图像:

def str_to_img(str, debug=False):
    """Converts a given string to a PNG image, and saves it to the return variable"""
    # use 12pt Courier New for ASCII art
    font = ImageFont.truetype("cour.ttf", 12)

    # do some string preprocessing
    str = str.replace("\n\n", "\n") # Reddit requires double newline for new line, don't let the bot do this
    str = html.unescape(str)

    img = Image.new('RGB', (1,1))
    d = ImageDraw.Draw(img)

    str_by_line = str.split("\n")
    num_of_lines = len(str_by_line)

    line_widths = []
    for i, line in enumerate(str_by_line):
        line_widths.append(d.textsize(str_by_line[i], font=font)[0])
    line_height = d.textsize(str, font=font)[1]     # the height of a line of text should be unchanging

    img_width = max(line_widths)                                    # the image width is the largest of the individual line widths
    img_height = num_of_lines * line_height             # the image height is the # of lines * line height

    # creating the output image
    # add 5 pixels to account for lowercase letters that might otherwise get truncated
    img = Image.new('RGB', (img_width, img_height + 5), 'white')
    d = ImageDraw.Draw(img)

    for i, line in enumerate(str_by_line):
        d.text((0,i*line_height), line, font=font, fill='black')
    output = BytesIO()

    if (debug):
        img.save('test.png', 'PNG')
    else:
        img.save(output, 'PNG')

    return output

就像我说的那样,我用utf-8编码所有内容,但特殊字符没有正确显示。我也使用官方.ttf文件中的Courier New,它假设支持广泛的字符和符号,所以我不确定它是什么问题。

我觉得这很明显。任何人都可以开导我吗?它不是ImageDraw,是吗?最重要的是,整个文本编码似乎有点含糊不清,所以即使阅读了其他StackOverflow帖子(以及关于编码的博客文章),我也很难接近真正的解决方案。

1 个答案:

答案 0 :(得分:0)

我目前无法自己进行任何测试,因为低代表我不能发表评论,所以我放弃了部分答案,希望能给出一些想法。我对Python 2也有点生疏,但试试吧..

所以有两件事。第一:

  

我拉动它时,我在utf-8中编码所有内容

你呢?

print(comment.body.encode('utf-8'))
print("=====\n")
output.append(comment)

您正在对打印输出进行编码,但是将原始注释附加到输出列表中,因为它是由praw输出的。 praw输出unicode对象吗?

因为我认为unicode对象是ImageDraw模块想要的。看看它的源代码,它似乎没有任何关于你试图渲染的文本的编码的线索。含义Python 2字符串可能会被渲染为单字节字符,在utf8编码的情况下导致输出中的垃圾。

http://pillow.readthedocs.org/en/latest/reference/ImageFont.html#PIL.ImageFont.truetype提到“编码”参数,默认为 unic ode。可能值得尝试设置以防万一。如果字体不兼容unicode,可能会引发错误。

Python 2中的编码并不好玩。但有一件事我仍然会尝试确保将unicode对象传递给ImageDraw(尝试unicode(str)或str.decode(“utf8”))