Question

我编写了django应用程序来标记文档列表。我尝试使用多处理并行运行它。但我发现不是所有内核都使用100％的计算能力，它只是轮流使单个用户使用其100％的计算能力而其他线程几乎空闲。我运行4核8线程ubuntu 14.04操作系统和python 2.7。在这里，我简化了我的代码，以便更容易理解我的代码。

tokenization.py
def compute_customizedStopwords():
    stopword_dictionary = open(BASE_DIR + "/app1/NLP/Dictionary/humanDecisionDictionary.txt",'r')
    customizedStopwords = set()
    # compute stopwords set 
    for line in stopword_dictionary:
        customizedStopwords.add(line.strip('\n').lower()
    return customizedStopwords

def tokenize_task(narrative, customizedStopwords)
    tokens = narrative.corpus.split(",")
    tokens = [token for token in tokens if token not in customizedStopwords]  # remove stopwords
    newTokenObjects = [ Token(token = token) for token in tokens] 
    Token.objects.bulk_create(newTokenObjects) # save all tokens to database
    return tokens


views.py
def tokenize(request) :
    narratives = models.Narrative.objects.all() # get all documents 
    customizedStopwords = compute_customizedStopwords() # get stopwords set
    pool = Pool()
    results = [pool.apply(tokenize_task, args=(narrative, customizedStopwords)) for narrative in narratives]
    tokens = []
    tokens += results # flat the token list
    return HttpResponse(tokens)

是因为数据库写操作是瓶颈，标记化本身非常快，但只有一个线程可以写入数据库，从而阻塞所有其他线程？如果是这种情况，是否有任何方法可以优化此代码？

我也对stopwrods字典集感到担忧。我怀疑python会为每个作业复制这个对象，并将它们分发给每个任务。这将增加内存成本，特别是我在数据库中有100万个文档。

我的Django多处理应用程序无法使用100％的所有内核

0 个答案: