正如标题所示,我正在尝试使用dask.bag从EC2实例上的S3读取单个文件:
from distributed import Executor, progress
from dask import delayed
import dask
import dask.bag as db
data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq')
我收到一个很长的错误:
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
322 bucket, key = split_path(path)
--> 323 out = self.s3.head_object(Bucket=bucket, Key=key)
324 out = {'ETag': out['ETag'], 'Key': '/'.join([bucket, key]),
/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
277 # The "self" in this scope is referring to the BaseClient.
--> 278 return self._make_api_call(operation_name, kwargs)
279
/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
571 if http.status_code >= 300:
--> 572 raise ClientError(parsed_response, operation_name)
573 else:
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
<ipython-input-43-0ad435c69ecc> in <module>()
4 #data = db.read_text('/Users/zen/Code/git/sra_data.fastq')
5 #data = db.read_text('/Users/zen/Code/git/pycuda-euler/data/Ba10k.sim1.fq')
----> 6 data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq', blocksize=900000)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bag/text.py in read_text(urlpath, blocksize, compression, encoding, errors, linedelimiter, collection, storage_options)
89 _, blocks = read_bytes(urlpath, delimiter=linedelimiter.encode(),
90 blocksize=blocksize, sample=False, compression=compression,
---> 91 **(storage_options or {}))
92 if isinstance(blocks[0], (tuple, list)):
93 blocks = list(concat(blocks))
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, **kwargs)
210 return read_bytes(storage_options.pop('path'), delimiter=delimiter,
211 not_zero=not_zero, blocksize=blocksize, sample=sample,
--> 212 compression=compression, **storage_options)
213
214
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in read_bytes(path, s3, delimiter, not_zero, blocksize, sample, compression, **kwargs)
91 offsets = [0]
92 else:
---> 93 size = getsize(s3_path, compression, s3)
94 offsets = list(range(0, size, blocksize))
95 if not_zero:
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in getsize(path, compression, s3)
185 def getsize(path, compression, s3):
186 if compression is None:
--> 187 return s3.info(path)['Size']
188 else:
189 with s3.open(path, 'rb') as f:
/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
327 return out
328 except (ClientError, ParamValidationError):
--> 329 raise FileNotFoundError(path)
330
331 def _walk(self, path, refresh=False):
FileNotFoundError: pycuda-euler-data/Ba10k.sim1.fq
据我所知,这正是文档所说的,不幸的是,我在网上看到的很多例子都使用了不再存在的旧的from_s3()方法。
但是我可以使用s3fs访问该文件:
sample, partitions = s3.read_bytes('pycuda-euler-data/Ba10k.sim1.fq', s3=s3files, delimiter=b'\n')
sample
b'@gi|30260195|ref|NC_003997.3|_5093_5330_1:0:0_1:0:0_0/1\nGATAACTCGATTTAAACCAGATCCAGAAAATTTTCA\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_7142_7326_1:1:0_0:0:0_1/1\nCTATTCCGCCGCATCAACTTGGTGAAGTAATGGATG\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_5524_5757_3:0:0_2:0:0_2/1\nGTAATTTAACTGGTGAGGACGTGCGTGATGGTTTAT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_2706_2926_1:0:0_3:0:0_3/1\nAGTAAAACAGATATTTTTGTAAATAGAAAAGAATTT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_500_735_3:1:0_0:0:0_4/1\nATACTCTGTGGTAAATGATTAGAATCATCTTGTGCT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_2449_2653_3:0:0_1:0:0_5/1\nCTTGAATTGCTACAGATAGTCATAGGTTAGCCCTTC\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_3252_3460_0:0:0_0:0:0_6/1\nCGATGTAATTGATACAGGTGGCGCTGTAAAATGGTT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_1860_2095_0:0:0_1:0:0_7/1\nATAAAAGATTCAATCGAAATATCAGCATCGTTTCCT\n+\n222222222222222222222222222222222222\n@gi|30260195|ref|NC_003997.3|_870_1092_1:0:0_0:0:0_8/1\nTTGGAAAAACCCATTTAATGCATGCAATTGGCCTTT\n ... etc.
我做错了什么?
根据建议,我回去检查权限。在桶上我添加了Grantee Everyone List,并在文件上添加了Grantee Everyone Open / Download。我仍然得到同样的错误。
答案 0 :(得分:2)
Dask使用库s3fs来管理S3上的数据。 s3fs项目使用亚马逊的boto3。您可以通过两种方式提供凭据:
您可以在主目录中放置.boto
文件
storage_options=
关键字您可以在db.read_text电话中添加storage_option=
关键字,以手动包含凭据信息。此选项是一个字典,其值将添加到s3fs.S3FileSystem constructor。