Question

使用单个SparkContext sc运行多个线程时遇到一些问题。我创建了一个线程池，然后每个线程从同一个表加载一些数据，但使用不同的行。每个线程在数据帧下处理加载的数据并应用多个转换。

我当前的问题是我应用转换时。例如，我在线程中有一个数据框，然后我应用像StringIndexer(inputCol="category", "indexed").fit(df).transform(df)这样的转换。但是，在执行这些线程的过程中，我最终会收到IllegalArgumentException: u'requirement failed: Output column indexed already exists.。奇怪的是，如果我将池大小调为1，那么一切都按照它应该运行。但是，如果我决定增加线程数，那么它就失败了。看起来在某些线程上使用了相同的数据帧。我不知道为什么会发生这种情况，我不会在这些线程之间共享任何变量或任何东西（只有SparkContext）。

为了创建线程，我使用multiprocessing.pool.ThreadPool。

def processModels(sc, table, models):
  for model in models:
    # Load the data coressponding to that model.
    query = "select * from %s where model=%d" % (table, model)
    df = sqlContext.sql(query)
    df = df.fillna(0).fillna("None").cache()

    # Apply some transformations to the dataset.        
    stages = []
    dtypes = df.dtypes 
    for dtype in dtypes:
        column_name = dtype[0]
        column_type = dtype[1]
        if(column_type=="string"):
            indexerModel = StringIndexer(inputCol=column_name, outputCol="%s_cat"%column_name).fit(df)
            if(len(indexerModel.labels)>1):
                stages.append(indexerModel.copy())
    pipeline = PipelineModel(stages=stages)
    df = pipeline.transform(df)
    print("Model %d has %d samples." % (model, df.count()))

Pyspark：多线程工作

0 个答案: