Question

我尝试调试所有可能的解决方案，但无法运行此功能并在群集上进行扩展，因为我需要处理1亿条记录。此脚本在本地节点上运行良好，但未能在Cloudera Amazon群集上运行。以下是适用于本地节点的示例数据。根据我的问题，我在udf中使用的2个文件没有在执行程序/容器或节点上分发，并且作业只是保持运行并且处理非常慢。我无法修复此代码以在群集上执行此操作。

    ##Link to the 2 files which i use in the script###
    ##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
    ####Link to the data set########
    ##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D

    #spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py


    from pyspark.sql.types import StringType
    from pyspark.sql.functions import udf
    import os
    from pyspark import SparkFiles
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import Row
    from pyspark.context import SparkContext
    from pyspark.sql import HiveContext
    from pyspark.sql.functions import udf
    from pyspark.sql import SQLContext

    def stanford(str):
        os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
        stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
        stanford_ner_path = SparkFiles.get("stanford-ner.jar")
        st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
        output = st.tag(str.split())
        organizations = []
        organization = ""
        for t in output:
            #The word
            word = t[0]
            #What is the current tag
            tag = t[1]
            #print(word, tag)
            #If the current tag is the same as the previous tag Append the current word to the previous word
            if (tag == "ORGANIZATION"):
                organization += " " + word
            organizations.append(organization)
            final = "-".join(organizations)
            return final


    stanford_lassification = udf(stanford, StringType())

    ###################Pyspark Section###############
    #Set context
    sc = SparkContext.getOrCreate()
    sc.setLogLevel("DEBUG")
    sqlContext = SQLContext(sc)

    #Get data
    df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")

    #Create new dataframe with new column organization
    df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))

    #Save result
    df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")

使用Pyspark脚本的Stanford CoreNLP用例在本地节点上运行正常但在纱线群集模式下运行速度非常慢

0 个答案: