将Paramiko连接SFTPFile作为输入传递到dask.dataframe.read_parquet

时间:2019-06-24 11:13:26

标签: python sftp paramiko dask parquet

我尝试传递类paramiko.sftp_file.SFTPFile而不是pandas.read_parquet的文件URL,它运行良好。但是当我对Dask尝试相同的操作时,它抛出了一个错误。以下是我尝试运行的代码以及出现的错误。我该如何工作?

import dask.dataframe as dd
import parmiko
ssh=paramiko.SSHClient()
sftp_client = ssh.open_sftp()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
source_file=sftp_client.open(str(parquet_file),'rb')
full_df = dd.read_parquet(source_file,engine='pyarrow')
print(len(full_df))
Traceback (most recent call last):
  File "C:\Users\rrrrr\Documents\jackets_dask.py", line 22, in <module>
    full_df = dd.read_parquet(source_file,engine='pyarrow')
  File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\dataframe\io\parquet.py", line 1173, in read_parquet
    storage_options=storage_options
  File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\bytes\core.py", line 368, in get_fs_token_paths
    raise TypeError('url type not understood: %s' % urlpath)
TypeError: url type not understood: <paramiko.sftp_file.SFTPFile object at 0x0000007712D9A208>

2 个答案:

答案 0 :(得分:1)

Dask不直接支持类似文件的对象。

您将必须实现他们的"file system" interface

我不确定为允许read_parquet您需要实现的最小方法集是什么。但您绝对必须实现open。像这样:

class SftpFileSystem(object):
    def open(self, path, mode='rb', **kwargs):
        return sftp_client.open(path, mode)

dask.bytes.core._filesystems['sftp'] = SftpFileSystem

df = dd.read_parquet('sftp://remote/path/file', engine='pyarrow')

有关完整的方法集,请参见implementation of LocalFileSystem

fsspec库中实际上有一个用于SFTP的文件系统实现:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem

另请参阅Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

答案 1 :(得分:1)

情况已经改变,您现在可以直接通过Dask进行操作。来自Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

的粘贴答案

在Dask的主版本中,文件系统操作现在使用select ASM, TECH, PreviousASM, PreviousTECH from OFSDA.ArchiveActivityDetails as AD outer apply ( select top(1) ADInner.ASM as PreviousASM, ADInner.TECH as PreviousTECH from OFSDA.ArchiveActivityDetails as ADInner where ADInner.ID_ServiceOrderNumber = AD.ID_ServiceOrderNumber and ADInner.ID_Activity < AD.ID_Activity order by ADInnerID_Activity desc ) Previous where ID_ServiceOrderNumber = 2370634229 ,它与以前的实现(s3,gcs,hdfs)一起支持其他一些file-systems,请参见协议标识符fsspec.registry.known_implementations

简而言之,如果您是从master那里安装fsspec和Dask的,那么现在可以使用像“ sftp:// user:pw @ host:port / path”之类的url。