使用Boto从AWS Glacier下载大型存档

时间:2015-01-16 13:58:14

标签: python amazon-web-services boto amazon-glacier

我正在尝试使用Python软件包Boto从Glacier下载大型存档(~1 TB)。我使用的当前方法如下所示:

import os
import boto.glacier
import boto
import time

ACCESS_KEY_ID = 'XXXXX'
SECRET_ACCESS_KEY = 'XXXXX'
VAULT_NAME = 'XXXXX'
ARCHIVE_ID = 'XXXXX'
OUTPUT = 'XXXXX'

layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
                              aws_secret_access_key = SECRET_ACCESS_KEY)

gv = layer2.get_vault(VAULT_NAME)

job = gv.retrieve_archive(ARCHIVE_ID)
job_id = job.id

while not job.completed:
    time.sleep(10)
    job = gv.get_job(job_id)

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT)

问题是作业ID在24小时后到期,这还不足以检索整个存档。我需要将下载分解为至少4个。我该怎么做并将输出写入单个文件?

2 个答案:

答案 0 :(得分:3)

似乎您可以在调用chunk_size时简单地指定job.download_to_file参数,如下所示:

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT, chunk_size=1024*1024)

但是,如果您无法在24小时内下载所有块,我认为您不能选择仅使用layer2下载您错过的那个块。

第一种方法

使用layer1,您只需使用方法get_job_output并指定要下载的字节范围。

看起来像那样:

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = gv.get_job_output(VAULT_NAME, job_id, (file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1

使用此脚本,您应该能够在脚本失败时重新运行该脚本,并继续将您的存档下载到您离开的位置。

第二种方法

通过挖掘boto代码,我在Job类中找到了一个你也可以使用的“私有”方法:_download_byte_range。使用此方法,您仍然可以使用layer2。

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = job._download_byte_range(file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1

答案 1 :(得分:0)

您必须在import pandas as pd from pyspark.ml import Pipeline, Transformer from pyspark.ml.feature import Bucketizer from pyspark.sql import SparkSession, DataFrame data = pd.DataFrame({ 'ball_column': [0, 1, 2, 3], 'keep_column': [7, 8, 9, 10], 'hall_column': [14, 15, 16, 17], 'bag_this_1': [21, 31, 41, 51], 'bag_this_2': [21, 31, 41, 51] }) df = spark.createDataFrame(data) df.show() class EditColumnNameWithReplacement(Transformer): def __init__(self, existing, new): super().__init__() self.existing = existing self.new = new def _transform(self, df: DataFrame) -> DataFrame: for (x, y) in zip(self.existing, self.new): df = df.withColumnRenamed(x, y) return df.select(*self.new) ## Capture 'bigInt' columns, and drop the rest bigint_list = [name for name, types in df.dtypes if types == 'bigint' or types == 'double'] edited_columns = [''.join(y for y in x if y.isalnum()) for x in bigint_list] spike_cols = [col for col in edited_columns if "bag" in col] reformattedColumns = EditColumnNameWithReplacement( existing=bigint_list, new=edited_columns) bagging = [ Bucketizer( splits=[-float("inf"), 10, 100, float("inf")], inputCol=x, outputCol=x + "bucketed") for x in spike_cols ] stages_ = [reformattedColumns] + bagging Pipeline(stages=stages_).fit(df).transform(df).show() 函数中添加region_name,如下所示:

boto.connect_glacier