Question

我在本地安装了spark 2.3.0并使用了pyspark。我能够毫无问题地处理本地文件。

但如果我必须从hdfs读取，我就无法做到。

我对spark访问hadoop文件的方式感到困惑。安装火花时，我要求复制winutil。我不明白winutil的作用是什么。

我们应该首先提出hadoop服务吗？如果我使用外部安装的hadoop并尝试在spark中使用它，则会出现java.lang.UnsatisfiedLinkError错误。任何指向正确docuementation的指针都会有很大的帮助。

谢谢，基兰

Answer 1

如果您正在使用 spark-submit 以群集模式运行应用程序，那么它可以带一个标志 - files ，用于传递来自的文件驱动程序节点到工人。我相信你能够以本地模式运行的原因是因为你的驱动程序和worker在同一台机器上，但是在集群模式下驱动程序和worker可能在不同的机器上。在这种情况下，Spark需要知道哪些文件要发送到工作节点。如标题学习Spark由Holden Karau所述，可以使用以下标志;安迪康文斯基;帕特里克温德尔; Matei Zaharia

--master
Indicates the cluster manager to connect to. The options for this flag are described in Table 7-1.

--deploy-mode
Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit >s itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.

--class
The “main” class of your application if you’re running a Java or Scala program.

--name
A human-readable name for your application. This will be displayed in Spark’s web UI.

--jars
A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.

--files
A list of files to be placed in the working directory of your application. This can be used for data files that you want to distribute to each node.

--py-files
A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip files.

--executor-memory
The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).

--driver-memory
The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).

<强>更新我假设Kiran有Hadoop设置（正如他在外部提到的那样）并且无法以编程方式从HDFS中读取程序。如果不是这样，请忽略答案。

Apache Spark：如何从hdfs文件中读取

1 个答案: