我尝试传递类paramiko.sftp_file.SFTPFile
而不是pandas.read_parquet
的文件URL,它运行良好。但是当我对Dask尝试相同的操作时,它抛出了一个错误。以下是我尝试运行的代码以及出现的错误。我该如何工作?
import dask.dataframe as dd
import parmiko
ssh=paramiko.SSHClient()
sftp_client = ssh.open_sftp()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
source_file=sftp_client.open(str(parquet_file),'rb')
full_df = dd.read_parquet(source_file,engine='pyarrow')
print(len(full_df))
Traceback (most recent call last):
File "C:\Users\rrrrr\Documents\jackets_dask.py", line 22, in <module>
full_df = dd.read_parquet(source_file,engine='pyarrow')
File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\dataframe\io\parquet.py", line 1173, in read_parquet
storage_options=storage_options
File "C:\Users\rrrrr\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\bytes\core.py", line 368, in get_fs_token_paths
raise TypeError('url type not understood: %s' % urlpath)
TypeError: url type not understood: <paramiko.sftp_file.SFTPFile object at 0x0000007712D9A208>
答案 0 :(得分:1)
Dask不直接支持类似文件的对象。
您将必须实现他们的"file system" interface。
我不确定为允许read_parquet
您需要实现的最小方法集是什么。但您绝对必须实现open
。像这样:
class SftpFileSystem(object):
def open(self, path, mode='rb', **kwargs):
return sftp_client.open(path, mode)
dask.bytes.core._filesystems['sftp'] = SftpFileSystem
df = dd.read_parquet('sftp://remote/path/file', engine='pyarrow')
有关完整的方法集,请参见implementation of LocalFileSystem
。
fsspec库中实际上有一个用于SFTP的文件系统实现:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.sftp.SFTPFileSystem
答案 1 :(得分:1)
情况已经改变,您现在可以直接通过Dask进行操作。来自Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?
的粘贴答案在Dask的主版本中,文件系统操作现在使用select ASM, TECH,
PreviousASM, PreviousTECH
from OFSDA.ArchiveActivityDetails as AD
outer apply (
select top(1) ADInner.ASM as PreviousASM, ADInner.TECH as PreviousTECH
from OFSDA.ArchiveActivityDetails as ADInner
where ADInner.ID_ServiceOrderNumber = AD.ID_ServiceOrderNumber
and ADInner.ID_Activity < AD.ID_Activity
order by ADInnerID_Activity desc
) Previous
where ID_ServiceOrderNumber = 2370634229
,它与以前的实现(s3,gcs,hdfs)一起支持其他一些file-systems,请参见协议标识符fsspec.registry.known_implementations
。
简而言之,如果您是从master那里安装fsspec和Dask的,那么现在可以使用像“ sftp:// user:pw @ host:port / path”之类的url。