我希望在后台运行一个脚本,它会每小时左右获取一次subreddit数据。既然我不想在我的数据库中有重复的条目,我想根据created_utc过滤我的搜索结果
这就是我目前的情况:
r = praw.Reddit(user_agent='soc')
submissions = r.get_subreddit('soccer').get_hot()
这就是我想要的:
r = praw.Reddit(user_agent='soc')
submissions = r.get_subreddit('soccer').get_hot(created_utc > '2016-02-18 14:33:14.000')
实现这一目标的方法有哪些?
答案 0 :(得分:2)
SubReddit
class和Reddit API都没有您想要的基于日期的过滤方法,因此这里有一个选项:
在将结果放入数据库之前,先在Python中过滤掉结果。 get_new
和from datetime import datetime, timedelta
import praw
# assuming you run this script every hour
an_hour_ago = datetime.utcnow() - timedelta(hours=1)
r = praw.Reddit(user_agent='soc')
submissions = r.get_subreddit('soccer').get_new()
submissions_list = [
# iterate through the submissions generator object
x for x in submissions
# add item if item.created_utc is newer than an hour ago
if datetime.utcfromtimestamp(x.created_utc) >= an_hour_ago
]
返回生成器对象,因此您可以使用这样的列表推导:
limit = 100 # Reddit maximum limit
total_list = []
submissions = r.get_subreddit('soccer').get_new(limit=limit)
submissions_list = [
x for x in submissions
if datetime.utcfromtimestamp(x.created_utc) >= an_hour_ago
]
total_list += submissions_list
if len(submissions_list) == limit:
submissions = r.get_subreddit('soccer').get_new(
# get limit of items past the last item in the total list
limit=100, params={"after": total_list[-1].fullname}
)
submissions_list_2 = [
# iterate through the submissions generator object
x for x in submissions
# add item if item.created_utc is newer than an hour ago
if datetime.utcfromtimestamp(x.created_utc) >= an_hour_ago
]
total_list += submissions_list_2
print total_list
默认情况下,Reddit只返回25个列表,因此如果您需要更多列表,则必须对其进行分页。
Select T.*, T1.StringValue, ISNUMERIC(T1.StringValue), T2.StringValue, ISNUMERIC(T2.StringValue), ISNUMERIC(T1.StringValue) * ISNUMERIC(T2.StringValue) AS IsNumeric
from #table1 T
cross apply dbo.Split(T.outputformula, ' ') T1
cross apply dbo.Split(T.outputformula, ' ') T2
where T1.Ordinal = 1 and T2.Ordinal = 2
如果提交的数量大于200,则必须将其放入递归函数中:subreddit_latest.py
答案 1 :(得分:1)
您应该比较日期时间对象而不是字符串,因此您应该将它们转换为日期时间,如下所示:
from datetime import datetime
date = datetime.strptime('2016-02-18 14:33:14.000', '%Y-%m-%d %H:%M:%S.%f')
你也应该对created_utc做同样的事情,然后比较两者。我不知道你可以在get_hot函数中进行比较,因为我从未使用过它。