Question

我无法找到内置的用于通过Spark列出目录内容的sparklyr，我尝试使用invoke：

sc <- spark_connect(master = "yarn", config=config)
path <- 'gs:// ***path to bucket on google cloud*** '
spath <- sparklyr::invoke_new(sc, 'org.apache.hadoop.fs.Path', path) 
fs <- sparklyr::invoke(spath, 'getFileSystem')
list <- sparklyr:: invoke(fs, 'listLocatedStatus')

Error: java.lang.Exception: No matched method found for class org.apache.hadoop.fs.Path.getFileSystem
    at sparklyr.Invoke.invoke(invoke.scala:134)
    at sparklyr.StreamHandler.handleMethodCall(stream.scala:123)
    at sparklyr.StreamHandler.read(stream.scala:66) ...

注意：是否存在有关可复制示例与分布式代码的准则？考虑到我在特定的Spark环境中运行，我不知道该如何做榜样。

Answer 1

// find and change text const test = layer.find('#P1')[0]; test.setAttr('text', 'changed');方法takes getFileSystem对象作为第一个参数：

org.apache.hadoop.conf.Configuration
返回拥有此路径的文件系统。

参数：

public FileSystem getFileSystem(Configuration conf) throws IOException-解决文件系统时要使用的配置

因此，检索conf实例的代码应大致如下所示：

FileSystem

另外# Retrieve Spark's Hadoop configuration hconf <- sc %>% spark_context() %>% invoke("hadoopConfiguration") fs <- sparklyr::invoke(spath, 'getFileSystem', hconf) takes either Path

listLocatedStatus

或Path and PathFilter（请注意，此实现为public org.apache.hadoop.fs.RemoteIterator<LocatedFileStatus> listLocatedStatus(Path f) throws FileNotFoundException, IOException）：

protected

因此，如果您要按上面所示的方式组织代码，则必须至少提供一个路径

public org.apache.hadoop.fs.RemoteIterator<LocatedFileStatus> listLocatedStatus(Path f)
                                                                    throws FileNotFoundException,
                                                                            IOException

实际上，直接获取sparklyr:: invoke(fs, "listLocatedStatus", spath)可能会更容易：

FileSystem

并使用fs <- invoke_static(sc, "org.apache.hadoop.fs.FileSystem", "get", hconf)

globStatus

其中lls <- invoke(fs, "globStatus", spath)是带有通配符的路径，例如：

spath

结果将是R sparklyr::invoke_new(sc, 'org.apache.hadoop.fs.Path', "/some/path/*")，可以很容易地对其进行迭代：

list

积分：

The answer至How can one list all csv files in an HDFS location within the Spark Scala shell?由@jaime

注释：

通常，如果您与非平凡的Java API交互，则用Java或Scala编写代码并提供最小的R接口更加有意义。
对于与特定文件对象存储的交互，使用专用软件包可能会更容易。对于Google Cloud Storage，您可以查看googleCloudStorageR。

Sparklyr：使用invoke方法列出R中目录的内容

1 个答案: