读取保存在S3上的部分numpy数组

时间:2018-01-19 19:21:35

标签: python arrays numpy amazon-s3

我有一个存储在AWS S3上的numpy数组。我完全撤回它并重建numpy数组。但是,我无法为部分数组执行此操作:

import boto3
import numpy as np
import sys

# Let's use Amazon S3
aws_session = boto3.Session(profile_name='myprofileAWS')
client = aws_session.client('s3')
resource = aws_session.resource('s3')
bucket_name = 'test'
bucket = resource.Bucket(bucket_name)

# Construct numpy array and upload on S3
tab = np.arange(100, dtype=np.int16)
tab.tofile('/temp/tab_test.bin')
bucket.upload_file('/temp/tab_test.bin', 'tab_test.bin')

# Check object size (returns 200 Bytes i.e. 100 items of 2 Bytes)
resource.Object(bucket_name=bucket_name, key='tab_test.bin').content_length

# Retrieve object
offset = 0
end = 200
obj_test = client.get_object(Bucket=bucket_name, 
                             Key='tab_test.bin',
                             Range='bytes={}-{}'.format(offset, end))
obj_test_string = obj_test['Body'].read()

# Reconstruct Array
# Return the right array well reconstructed
np.fromstring(obj_test_string, dtype=np.int16)  
#> array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
#     17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
#     34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
#     51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
#     68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
#     85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], dtype=int16)

# Retrieve half of the object
# Here I got a ValueError
offset = 0
end = 100
obj_half_test = client.get_object(Bucket=bucket_name, 
                             Key='tab_test.bin',
                             Range='bytes={}-{}'.format(offset, end))
obj_half_test_string = obj_test['Body'].read()

# Reconstruct Array
np.fromstring(obj_half_test_string, dtype=np.int16)  

在最后一次通话中,我收到以下错误:

ValueError: string size must be a multiple of element size

然而,当我直接尝试将numpy数组转换为字符串时,它可以工作:

# return a numpy array of 50 elements
np.fromstring(tab.tostring()[:100], dtype=np.int16)

> array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
   34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], dtype=int16)

* [编辑] *

另一个测试,当我更改半对象的预期dtype时:

np.fromstring(obj_test_half_string, dtype=np.int8)
> array([ 0,  0,  1,  0,  2,  0,  3,  0,  4,  0,  5,  0,  6,  0,  7,  0,  8,
        0,  9,  0, 10,  0, 11,  0, 12,  0, 13,  0, 14,  0, 15,  0, 16,  0,
       17,  0, 18,  0, 19,  0, 20,  0, 21,  0, 22,  0, 23,  0, 24,  0, 25,
        0, 26,  0, 27,  0, 28,  0, 29,  0, 30,  0, 31,  0, 32,  0, 33,  0,
       34,  0, 35,  0, 36,  0, 37,  0, 38,  0, 39,  0, 40,  0, 41,  0, 42,
        0, 43,  0, 44,  0, 45,  0, 46,  0, 47,  0, 48,  0, 49,  0, 50], dtype=int8)

* [EDIT2]解决方案*

https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35中所述,字节范围是包含的,这意味着如果我想要前500个字节,我需要写bytes=0-499而不是bytes=0-500。我们在查看len(obj_half_test_string) ---> = 101时会进行验证。因此,当我将结尾从100更改为99时,它会按预期工作。

0 个答案:

没有答案