Question

在Instagram API数据抓取后，我的数据输出出现问题 - （ https://api.instagram.com/v1/media/search?lat=48.858844&lng=2.294351&access_token=ACCESS-TOKEN ）。我的数据按照时间以降序写入txt文件（即最新的帖子将始终是第一个条目，而最后一个条目是最早的帖子）。

有什么方法可以调整我的代码，以便它以升序写入我的txt文件时？最后一个条目将始终是最新的Instagram帖子。基本原理是，我想利用API的min_timestamp参数来删除重复的数据爬网。

此网站的新功能，但这是我的源代码：

    json_data = urlopen(url).read()
    response = json.loads(json_data)


    file_output = "instagram_output_" + query + ".txt"
    f = open(file_output, 'a+')


    def convert_time_unix_to_human(input_time):
      return time.strftime("%Y-%m-%d %H:%M",time.localtime(int(input_time)))


    def write_to_txt(data):
     if 'text' in data['caption']:
      f.write(data['caption']['id'] + '|')
      f.write(data['caption']['created_time'] + '|')
      f.write(convert_time_unix_to_human(data['caption']['created_time']) + '|')
      f.write(data['caption']['from']['username'] + '|')
      f.write(data['caption']['from']['id'] + '|')
      f.write(str(data['location']['latitude']) + '|')
      f.write(str(data['location']['longitude']) + '|')
      f.write(str(data['likes']['count']) + '|')
      f.write(str(data['comments']['count']) + '|')

      text = data['caption']['text'].replace('\n', ' ').replace('\r', ' ').encode('utf-8')
      f.write(text + '\n')

    def get_next_page(content):
      if 'pagination' in content:
         if 'next_url' in content['pagination']:
           return content['pagination']['next_url']
         else:
           return ''
      else:
        return ''


     count = 1
     while url != '' and count < 5000:
         next_data = urlopen(url).read()
         response = json.loads(next_data)
         for i in range(0, len(response['data'])):
           if response['data'][i]['caption'] is not None:

               if response['data'][i]['location'] is not None:
                  if 'latitude' in response['data'][i]['location'] and 'longitude' in response['data'][i]['location']:
                       write_to_txt(response['data'][i])
                       count_caption += 1

          url = get_next_page(response)
          count = count + 1

     f.close()
     print "Successfully crawled {0}".format(count_caption)

非常感谢任何帮助！

Answer 1

您正在处理响应并从起始索引（0）到结束索引（响应json数组长度）写入文件。相反，尝试将其反向处理，如

for i in range(len(response['data']), 0):

然后你可能会按升序编写数据

从Instagram Media API抓取的数据排序（根据时间戳）

1 个答案: