Question

我想通过pyspark解释器将一些熊猫数据帧读/写到s3。

我尝试过：

阅读

pd.read_csv('s3://xxx/userdevice/part-00000-3332c494-9b5b-4781-8482-1c96f3efda21-c000.csv')

错误：

Fail to execute line 2: pd.read_csv('s3://xxx/userdevice/part-00000-3332c494-9b5b-4781-8482-1c96f3efda21-c000.csv')
Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-3312558659230129318.py", line 380, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib64/python2.7/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib64/python2.7/site-packages/pandas/io/parsers.py", line 424, in _read
    filepath_or_buffer, encoding, compression)
  File "/usr/local/lib64/python2.7/site-packages/pandas/io/common.py", line 209, in get_filepath_or_buffer
    mode=mode)
  File "/usr/local/lib64/python2.7/site-packages/pandas/io/s3.py", line 38, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
  File "/usr/local/lib/python2.7/site-packages/s3fs/core.py", line 335, in open
    s3_additional_kwargs=kw)
  File "/usr/local/lib/python2.7/site-packages/s3fs/core.py", line 1143, in __init__
    info = self.info()
  File "/usr/local/lib/python2.7/site-packages/s3fs/core.py", line 1161, in info
    refresh=refresh, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/s3fs/core.py", line 478, in info
    raise FileNotFoundError(path)
FileNotFoundError: xxx/userdevice/part-00000-3332c494-9b5b-4781-8482-1c96f3efda21-c000.csv

我担心某些spark节点未配置s3，因此我手动创建了会话

def s3_open(path, *args, **kwargs):
    import boto3
    session = boto3.Session(
    aws_access_key_id='xxxx',
    aws_secret_access_key='xxxxxx',
    )
    f = smart_open.smart_open(path, *args, s3_session=session, **kwargs)
    return f

with s3_open('s3://xxx/balance/%s.xlsx' % name, 'w') as f:
    r.to_excel(f)

遇到错误

Fail to execute line 10: with s3_open('s3://xxx/balance/%s.xlsx' % name, 'w') as f:
Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-3312558659230129318.py", line 380, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 10, in <module>
  File "<stdin>", line 7, in s3_open
  File "/usr/local/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 231, in smart_open
    binary, filename = _open_binary_stream(uri, binary_mode, **kw)
  File "/usr/local/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 338, in _open_binary_stream
    return _s3_open_uri(parsed_uri, mode, **kw), filename
  File "/usr/local/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 400, in _s3_open_uri
    return smart_open_s3.open(parsed_uri.bucket_id, parsed_uri.key_id, mode, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/smart_open/s3.py", line 74, in open
    fileobj = BufferedOutputBase(bucket_id, key_id, min_part_size=s3_min_part_size, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/smart_open/s3.py", line 369, in __init__
    raise ValueError('the bucket %r does not exist, or is forbidden for access' % bucket)
ValueError: the bucket u'xxx' does not exist, or is forbidden for access

上面的pandas代码可在该计算机上使用，只是无法正常工作。

但是pyspark很好，就像下面的作品

df1.coalesce(1).write.csv('s3://xxx/storeincomelogs', header=True, mode="overwrite")
df2.coalesce(1).write.csv('s3://xxx/storeproductincomelogs', header=True, mode="overwrite")

无法通过Pandas / smart_open在AWS Emr pyspark上读/写s3

0 个答案: