在Instagram API数据抓取后,我的数据输出出现问题 - ( https://api.instagram.com/v1/media/search?lat=48.858844&lng=2.294351&access_token=ACCESS-TOKEN )。我的数据按照时间以降序写入txt文件(即最新的帖子将始终是第一个条目,而最后一个条目是最早的帖子)。
有什么方法可以调整我的代码,以便它以升序写入我的txt文件时?最后一个条目将始终是最新的Instagram帖子。基本原理是,我想利用API的min_timestamp参数来删除重复的数据爬网。
此网站的新功能,但这是我的源代码:
json_data = urlopen(url).read()
response = json.loads(json_data)
file_output = "instagram_output_" + query + ".txt"
f = open(file_output, 'a+')
def convert_time_unix_to_human(input_time):
return time.strftime("%Y-%m-%d %H:%M",time.localtime(int(input_time)))
def write_to_txt(data):
if 'text' in data['caption']:
f.write(data['caption']['id'] + '|')
f.write(data['caption']['created_time'] + '|')
f.write(convert_time_unix_to_human(data['caption']['created_time']) + '|')
f.write(data['caption']['from']['username'] + '|')
f.write(data['caption']['from']['id'] + '|')
f.write(str(data['location']['latitude']) + '|')
f.write(str(data['location']['longitude']) + '|')
f.write(str(data['likes']['count']) + '|')
f.write(str(data['comments']['count']) + '|')
text = data['caption']['text'].replace('\n', ' ').replace('\r', ' ').encode('utf-8')
f.write(text + '\n')
def get_next_page(content):
if 'pagination' in content:
if 'next_url' in content['pagination']:
return content['pagination']['next_url']
else:
return ''
else:
return ''
count = 1
while url != '' and count < 5000:
next_data = urlopen(url).read()
response = json.loads(next_data)
for i in range(0, len(response['data'])):
if response['data'][i]['caption'] is not None:
if response['data'][i]['location'] is not None:
if 'latitude' in response['data'][i]['location'] and 'longitude' in response['data'][i]['location']:
write_to_txt(response['data'][i])
count_caption += 1
url = get_next_page(response)
count = count + 1
f.close()
print "Successfully crawled {0}".format(count_caption)
非常感谢任何帮助!
答案 0 :(得分:0)
您正在处理响应并从起始索引(0)到结束索引(响应json数组长度)写入文件。相反,尝试将其反向处理,如
for i in range(len(response['data']), 0):
然后你可能会按升序编写数据