我在本地安装了spark 2.3.0并使用了pyspark。我能够毫无问题地处理本地文件。
但如果我必须从hdfs读取,我就无法做到。
我对spark访问hadoop文件的方式感到困惑。安装火花时,我要求复制winutil。我不明白winutil的作用是什么。
我们应该首先提出hadoop服务吗? 如果我使用外部安装的hadoop并尝试在spark中使用它,则会出现java.lang.UnsatisfiedLinkError错误。任何指向正确docuementation的指针都会有很大的帮助。
谢谢, 基兰
答案 0 :(得分:0)
如果您正在使用 spark-submit 以群集模式运行应用程序,那么它可以带一个标志 - files ,用于传递来自的文件驱动程序节点到工人。我相信你能够以本地模式运行的原因是因为你的驱动程序和worker在同一台机器上,但是在集群模式下驱动程序和worker可能在不同的机器上。在这种情况下,Spark需要知道哪些文件要发送到工作节点。如标题学习Spark由Holden Karau所述,可以使用以下标志;安迪康文斯基;帕特里克温德尔; Matei Zaharia
--master
Indicates the cluster manager to connect to. The options for this flag are described in Table 7-1.
--deploy-mode
Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit >s itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.
--class
The “main” class of your application if you’re running a Java or Scala program.
--name
A human-readable name for your application. This will be displayed in Spark’s web UI.
--jars
A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.
--files
A list of files to be placed in the working directory of your application. This can be used for data files that you want to distribute to each node.
--py-files
A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip files.
--executor-memory
The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
--driver-memory
The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
<强>更新强> 我假设Kiran有Hadoop设置(正如他在外部提到的那样)并且无法以编程方式从HDFS中读取程序。如果不是这样,请忽略答案。