从Instagram Media API抓取的数据排序(根据时间戳)

时间:2015-11-04 09:38:29

标签: sorting instagram

在Instagram API数据抓取后,我的数据输出出现问题 - ( https://api.instagram.com/v1/media/search?lat=48.858844&lng=2.294351&access_token=ACCESS-TOKEN )。我的数据按照时间以降序写入txt文件(即最新的帖子将始终是第一个条目,而最后一个条目是最早的帖子)。

有什么方法可以调整我的代码,以便它以升序写入我的txt文件时?最后一个条目将始终是最新的Instagram帖子。基本原理是,我想利用API的min_timestamp参数来删除重复的数据爬网。

此网站的新功能,但这是我的源代码:

    json_data = urlopen(url).read()
    response = json.loads(json_data)


    file_output = "instagram_output_" + query + ".txt"
    f = open(file_output, 'a+')


    def convert_time_unix_to_human(input_time):
      return time.strftime("%Y-%m-%d %H:%M",time.localtime(int(input_time)))


    def write_to_txt(data):
     if 'text' in data['caption']:
      f.write(data['caption']['id'] + '|')
      f.write(data['caption']['created_time'] + '|')
      f.write(convert_time_unix_to_human(data['caption']['created_time']) + '|')
      f.write(data['caption']['from']['username'] + '|')
      f.write(data['caption']['from']['id'] + '|')
      f.write(str(data['location']['latitude']) + '|')
      f.write(str(data['location']['longitude']) + '|')
      f.write(str(data['likes']['count']) + '|')
      f.write(str(data['comments']['count']) + '|')

      text = data['caption']['text'].replace('\n', ' ').replace('\r', ' ').encode('utf-8')
      f.write(text + '\n')

    def get_next_page(content):
      if 'pagination' in content:
         if 'next_url' in content['pagination']:
           return content['pagination']['next_url']
         else:
           return ''
      else:
        return ''


     count = 1
     while url != '' and count < 5000:
         next_data = urlopen(url).read()
         response = json.loads(next_data)
         for i in range(0, len(response['data'])):
           if response['data'][i]['caption'] is not None:

               if response['data'][i]['location'] is not None:
                  if 'latitude' in response['data'][i]['location'] and 'longitude' in response['data'][i]['location']:
                       write_to_txt(response['data'][i])
                       count_caption += 1

          url = get_next_page(response)
          count = count + 1

     f.close()
     print "Successfully crawled {0}".format(count_caption)

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:0)

您正在处理响应并从起始索引(0)到结束索引(响应json数组长度)写入文件。相反,尝试将其反向处理,如

for i in range(len(response['data']), 0):

然后你可能会按升序编写数据