我正在尝试使用Reddit评论进行一些文本分析。我目前打印出的主体和upvote计算了给定subreddit的“热门”帖子中超过5个upvotes的所有评论:
import praw
reddit = praw.Reddit(client_id=ID,
client_secret=SECRET, password=PWORD,
user_agent=UAGENT, username=UNAME)
subreddit = reddit.subreddit('cryptocurrency')
for submission in subreddit.hot(limit=10):
submission.comments.replace_more(limit=10)
for comment in submission.comments.list():
submission.comment_sort = 'top'
if comment.ups > 5:
print(comment.body, comment.ups)
然而,输出看起来像这样:
(u'Just hodl and let the plebs lose money on scamcoin ICO\'s that don\'t even have a working product. I don\'t understand some of these "traders" and "investors".', 9)
(u"Good idea imho but it's gonna be abused af. Think about it. It will be the sexual go to app real soon. If they will 'ban' nudity on it, then you will simply get the instagram chicks on there with all the horny guys liking their photos and giving them free money. 'if this gets 1000 likes I will post a pic of me in bikini' ", 7)
(u"But but but, I just sold a kidney and bought in at the top, now I can't afford to get the stitches removed!\n\n/s just in case.", 7)
两个问题:
我的最终目标是让这个输出整齐有序,以便我可以分析关键字与upvote计数(哪些关键字得到最多的投票等)。
谢谢!
答案 0 :(得分:0)
对问题2的回答:看起来您正在使用Python 2编写,但使用的是Python 3 print
语法。要在print
来电中删除元组符号,您需要
from __future__ import print_function
位于程序的顶部。
答案 1 :(得分:0)
1)是否可以使用python将输出转换为JSON?
几乎就是这个
output_string = json.dumps(comments)
除了几个键导致错误TypeError: Object of type Foo is not JSON serializable
我们可以解决这个问题。不可序列化的PRAW对象在转换为字符串时将正常工作。
def is_serializable(k, v):
try:
json.dumps({k: v})
except TypeError:
return False
return True
for comment in comments:
for k, v in comment.items():
if is_serializable(k, v):
comment[k] = v
else:
comment[k] = str(v)
现在保存有效。
json.dumps(comments)
2)如果没有,我如何摆脱除正文和增票数目之外的所有多余字符?
我认为您在问如何删除不需要的键。您可以使用:
save_keys = ['body', 'ups']
for k in list(comment):
if not k in save_keys:
del comment[k]
我们使用list(dict)
遍历dict
密钥的副本。这样可以防止您更改要迭代的对象。
list(dict)
与`list(dict.keys())