Question

我想在群集上运行我的代码：我的代码：

import java.util.Properties

import edu.stanford.nlp.ling.CoreAnnotations._
import edu.stanford.nlp.pipeline._
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer

object Pre2 {

  def plainTextToLemmas(text: String, pipeline: StanfordCoreNLP): Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 0 ) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("pre2")

    val sc = new SparkContext(conf)
      val plainText = sc.textFile("data/in.txt")
      val lemmatized = plainText.mapPartitions(p => {
        val props = new Properties()
        props.put("annotators", "tokenize, ssplit, pos, lemma")
        val pipeline = new StanfordCoreNLP(props)
        p.map(q => plainTextToLemmas(q, pipeline))
      })
      val lemmatized1 = lemmatized.map(l => l.head + l.tail.mkString(" "))
      val lemmatized2 = lemmatized1.filter(_.nonEmpty)
      lemmatized2.coalesce(1).saveAsTextFile("data/out.txt)
  }
}

和群集功能：

2个节点

每个节点都有：60g RAM

每个节点都有：48个核心

共享磁盘

我在这个集群上安装了Spark，其中一个节点是master和worker，另一个节点是worker。

当我在终端中使用此命令运行我的代码时：

./ bin / spark-submit --master spark：//192.168.1.20：7077 --class Main --deploy-mode cluster code / Pre2.jar

它显示：

15/08/19 15:27:21 WARN RestSubmissionClient：无法连接服务器spark：//192.168.1.20：7077。警告：主端点 spark：//192.168.1.20：7077不是REST服务器。回到而是遗留提交网关。 15/08/19 15:27:22警告 NativeCodeLoader：无法加载native-hadoop库平台...使用内置的java类适用的驱动程序成功提交为driver-20150819152724-0002 ...等待在轮询主机之前为驱动程序状态...轮询主机为驱动程序状态driver-20150819152724-0002正在运行RUNNING驱动程序 1192.168.1.19:33485（worker-20150819115013-192.168.1.19-33485）

如何在Spark独立群集上运行上述代码？

Answer 1

确保使用8080端口检出WebUI。在您的示例中，它将是192.168.1.20:8080。

如果您在Spark Standalone Cluster模式下运行它，请在没有--deploy-mode cluster的情况下尝试它，并通过添加--executor-memory 60g

对您的节点内存进行硬编码

Answer 2

“警告：主端点spark：//192.168.1.20：7077不是REST服务器” 从错误中，它看起来像主要的休息网址是不同的。其余的URL可以在master_url上找到：8080 UI

Answer 3

- 部署模式群集特定于YARN，不再使用它。
如果两个节点都出现在Spark UI上，请检查群集，使用给定的配置，我将在节点上启动worker。
您已将主人硬编码为本地 .setMaster（“local”）删除它或用主URL替换它。

我假设您要在spark独立群集模式下运行，而不是在纱线群集模式下运行。因为配置的其余部分是为了

如何在Spark Cluster模式下运行此代码

3 个答案: