python praw获取注释并以文件格式写入

时间:2017-07-05 06:41:32

标签: python web-scraping praw reddit

我收到了一个subreddit的内容。 subreddit是AR。 我需要获得帖子ID,标题,帖子内容,作者,发布日期,分数,评论和评论ID,然后写入txt文件。 我现在面临的问题是:

(1)我可以将评论和评论ID合并到一个文件中吗?因此,它将是post ID, title, post content, author, post date, score, comments, and comment ID (2)我得到的selftext有分隔线,所以在我的output.txt中显示像

blablabla

blablabla

blablabla

例如,[this reddit] [1]有多个分隔线。 我希望所有内容都在一行中,因为数据将被转移到csv / excel中以供将来分析。

我的代码:

import praw, datetime, os
reddit = praw.Reddit('bot1')
subreddit = reddit.subreddit('AR')
for submission in subreddit.top(limit=1):
    date = datetime.datetime.utcfromtimestamp(submission.created_utc)

    for comment in submission.comments:
        print("Comment author: ", comment.author)
        print("Comments: ", comment.body)
        indexFile_comment = open('path' + 'index_comments.txt', 'a+')
        indexFile_comment.write('"' + str(comment.author) + '"' + ', ' + '"' + str(comment.body) + '"' + '\n')
    print("Post ID: ", submission.id)
    print("Title: ", submission.title)
    print("Post Content: ", submission.selftext)
    print("User Name: ", submission.author)
    print("Post Date: ", date)
    print("Point: ", submission.score)
    indexFile = open('path' + 'index.txt', 'a+')
    indexFile.write('"' + str(submission.id) + '"' + ', ' + '"' + str(submission.title) + '"' + ', ' + '"' + str(submission.selftext) + '"' + ', ' + '"' + str(submission.author) + '"' + ', ' + '"' + str(date) + '"' + ', ' + '"' + str(submission.score) + '"' + '\n')
    print ("Successfuly writing in file")
    indexFile.close()

1 个答案:

答案 0 :(得分:0)

要在一行中提交提交,您可以在代码中实现st.replace("\n"," ")。变量stsubmission.selftext的位置。

要获取评论ID,您可以执行comment.id并在for循环中获取正文comment.body

修改

在第一行中,我只添加了submission.idsubmission.title,但您可以以相同的方式添加其余内容。循环将注释添加到同一字符串的末尾。在for循环之后,我用空格字符替换任何新的行字符。您可以将record写入文本文件,当您转到下一次提交时,将下一个record附加到文本文件中的新行。

record = str(submission.id) + " " + str(submission.title) + " " 
for comment in submission.comments:
    record = record + comment.author + " " + comment.body + " "
record.replace("\n", " ")