Question

我想加载一个word2vec模型并通过执行单词类比任务来评估它（例如 a是b，因为c是某种东西？）。要做到这一点，首先我加载我的w2v模型：

model = Word2VecModel.load(spark.sparkContext, str(sys.argv[1]))

然后我调用mapper来评估模型：

rdd_lines = spark.read.text("questions-words.txt").rdd.map(getAnswers)

getAnswers函数每次从 questions-words.txt 读取一行，其中每行包含问题以及评估我的模型的答案（例如雅典希腊巴格达伊拉克，其中a =雅典，b =希腊，c =巴格达，等等=伊拉克）。阅读完毕后，我创建了current_question和actual_answer（例如：current_question=Athens Greece Baghdad和actual_answer=Iraq）。之后，我调用用于计算类比的getAnalogy函数（基本上，考虑到它计算答案的问题）。最后，在计算出类比后，我返回答案并将其写入文本文件。

问题是我得到以下异常：

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers.

并且我认为它被抛出，因为我在map函数中使用了模型。这个question与我的问题类似，但我不知道如何将这个答案应用到我的代码中。我怎么解决这个问题？以下是完整的代码：

def getAnalogy(s, model):
    try:
        qry = model.transform(s[0]) - model.transform(s[1]) - model.transform(s[2])    
        res = model.findSynonyms((-1)*qry,5) # return 5 "synonyms"
        res = [x[0] for x in res]
        for k in range(0,3):
            if s[k] in res:
                res.remove(s[k])
        return res[0]
    except ValueError:
        return "NOT FOUND"

def getAnswers (text):
    tmp = text[0].split(' ', 3)
    answer_list = []
    current_question = " ".join(str(x) for x in tmp[:3])
    actual_answer = tmp[-1]

    model_answer = getAnalogy(current_question, model)
    if model_answer is "NOT FOUND":
        answer_list.append("NOT FOUND\n")
    elif model_answer is actual_answer:
        answer_list.append("TRUE\n")
    else:
        answer_list.append("FALSE:\n")
    return answer_list.append


if __name__ == "__main__":

    if len(sys.argv) != 3:
        print("Usage: my_test <file>", file=sys.stderr)
        exit(-1)


    spark = SparkSession\
    .builder\
    .appName("my_test")\
    .getOrCreate()


    model = Word2VecModel.load(spark.sparkContext, str(sys.argv[1]))

    rdd_lines = spark.read.text("questions-words.txt").rdd.map(getAnswers)

    dataframe = rdd_lines.toDF()

    dataframe.write.text(str(sys.argv[2]))

    spark.stop()

Answer 1

正如您已经怀疑的那样，您无法在地图功能中使用该模型。另一方面，MyColumn NewColumn 1 2 --> If the false branch would be evaluated it'd produce an #ERROR 2 1 3 1 1 2 --> If the false branch would be evaluated it'd produce an #ERROR文件不是那么大（~20K行），所以你最好使用vanilla Python列表推导进行评估（它基本上是你链接的问题中的第一个建议答案）;它并不快，但它只是一次性的任务。这是一种方法，使用my getAnalogy function，因为您已经为错误处理添加了它（请注意我已经从questions-answers.txt删除了＆＃39;评论＆＃39;行，并且您应该将其转换为小写，你似乎没有在你的代码中做的事情）：

questions-answers.txt

因此，您的评估列表现在可以构建为

from pyspark.mllib.feature import Word2Vec, Word2VecModel
model = Word2VecModel.load(sc, "word2vec/demo_200") # model built with k=200
with open('/home/ctsats/word2vec/questions-words.txt') as f:
    lines = f.readlines()
lines2 = [x.lower() for x in lines] # all to lowercase
lines3 = [x.strip('\n') for x in lines2] # remove end-of-line characters
lines4 = [x.split(' ',3) for x in lines3]
lines4[0] # check:
# ['Athens', 'Greece', 'Baghdad', 'Iraq']

def getAnswers (text, model):
    actual_answer = text[-1]
    question = [text[0], text[1], text[2]]
    model_answer = getAnalogy(question, model)
    if model_answer == "NOT FOUND":
        correct_answer = "NOT FOUND"
    elif model_answer == actual_answer:
        correct_answer = "TRUE"
    else:
        correct_answer = "FALSE"
    return text, model_answer, correct_answer

以下是前20个条目的示例（模型为answer_list = [getAnswers(x, model) for x in lines4]）：

k=200

如何加载word2vec模型并将其函数调用到映射器

1 个答案: