从Dask.bag访问S3

时间:2016-10-11 04:43:06

标签: amazon-s3 amazon-ec2 dask

正如标题所示,我正在尝试使用dask.bag从EC2实例上的S3读取单个文件:

from distributed import Executor, progress
from dask import delayed
import dask
import dask.bag as db

data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq')

我收到一个很长的错误:

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
    322                 bucket, key = split_path(path)
--> 323                 out = self.s3.head_object(Bucket=bucket, Key=key)
    324                 out = {'ETag': out['ETag'], 'Key': '/'.join([bucket, key]),

/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    277             # The "self" in this scope is referring to the BaseClient.
--> 278             return self._make_api_call(operation_name, kwargs)
    279 

/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    571         if http.status_code >= 300:
--> 572             raise ClientError(parsed_response, operation_name)
    573         else:

ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-43-0ad435c69ecc> in <module>()
      4 #data = db.read_text('/Users/zen/Code/git/sra_data.fastq')
      5 #data = db.read_text('/Users/zen/Code/git/pycuda-euler/data/Ba10k.sim1.fq')
----> 6 data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq', blocksize=900000)

/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bag/text.py in read_text(urlpath, blocksize, compression, encoding, errors, linedelimiter, collection, storage_options)
     89             _, blocks = read_bytes(urlpath, delimiter=linedelimiter.encode(),
     90                     blocksize=blocksize, sample=False, compression=compression,
---> 91                     **(storage_options or {}))
     92             if isinstance(blocks[0], (tuple, list)):
     93                 blocks = list(concat(blocks))

/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, **kwargs)
    210     return read_bytes(storage_options.pop('path'), delimiter=delimiter,
    211             not_zero=not_zero, blocksize=blocksize, sample=sample,
--> 212             compression=compression, **storage_options)
    213 
    214 

/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in read_bytes(path, s3, delimiter, not_zero, blocksize, sample, compression, **kwargs)
     91             offsets = [0]
     92         else:
---> 93             size = getsize(s3_path, compression, s3)
     94             offsets = list(range(0, size, blocksize))
     95             if not_zero:

/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in getsize(path, compression, s3)
    185 def getsize(path, compression, s3):
    186     if compression is None:
--> 187         return s3.info(path)['Size']
    188     else:
    189         with s3.open(path, 'rb') as f:

/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
    327                 return out
    328             except (ClientError, ParamValidationError):
--> 329                 raise FileNotFoundError(path)
    330 
    331     def _walk(self, path, refresh=False):

FileNotFoundError: pycuda-euler-data/Ba10k.sim1.fq

据我所知,这正是文档所说的,不幸的是,我在网上看到的很多例子都使用了不再存在的旧的from_s3()方法。

但是我可以使用s3fs访问该文件:

sample, partitions = s3.read_bytes('pycuda-euler-data/Ba10k.sim1.fq', s3=s3files, delimiter=b'\n')

sample

b'@gi|30260195|ref|NC_003997.3|_5093_5330_1:0:0_1:0:0_0/1\nGATAACTCGATTTAAACCAGATCCAGAAAATTTTCA\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_7142_7326_1:1:0_0:0:0_1/1\nCTATTCCGCCGCATCAACTTGGTGAAGTAATGGATG\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_5524_5757_3:0:0_2:0:0_2/1\nGTAATTTAACTGGTGAGGACGTGCGTGATGGTTTAT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_2706_2926_1:0:0_3:0:0_3/1\nAGTAAAACAGATATTTTTGTAAATAGAAAAGAATTT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_500_735_3:1:0_0:0:0_4/1\nATACTCTGTGGTAAATGATTAGAATCATCTTGTGCT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_2449_2653_3:0:0_1:0:0_5/1\nCTTGAATTGCTACAGATAGTCATAGGTTAGCCCTTC\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_3252_3460_0:0:0_0:0:0_6/1\nCGATGTAATTGATACAGGTGGCGCTGTAAAATGGTT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_1860_2095_0:0:0_1:0:0_7/1\nATAAAAGATTCAATCGAAATATCAGCATCGTTTCCT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_870_1092_1:0:0_0:0:0_8/1\nTTGGAAAAACCCATTTAATGCATGCAATTGGCCTTT\n ... etc.

我做错了什么?

编辑:

根据建议,我回去检查权限。在桶上我添加了Grantee Everyone List,并在文件上添加了Grantee Everyone Open / Download。我仍然得到同样的错误。

1 个答案:

答案 0 :(得分:2)

Dask使用库s3fs来管理S3上的数据。 s3fs项目使用亚马逊的boto3。您可以通过两种方式提供凭据:

使用.boto文件

您可以在主目录中放置.boto文件

使用storage_options=关键字

您可以在db.read_text电话中添加storage_option=关键字,以手动包含凭据信息。此选项是一个字典,其值将添加到s3fs.S3FileSystem constructor