通过大型数据集迭代Django ORM很慢

时间:2015-03-06 20:55:30

标签: python django

我正在使用Django ORM从数百万个项目的数据库中获取数据。但是,计算需要一段时间(40分钟+),而且我不确定如何确定问题的位置。

我使用的模型:

class user_chartConfigurationData(models.Model):
    username_chartNum = models.ForeignKey(user_chartConfiguration, related_name='user_chartConfigurationData_username_chartNum')
    openedConfig = models.ForeignKey(user_chartConfigurationChartID, related_name='user_chartConfigurationData_user_chartConfigurationChartID')
    username_selects = models.CharField(max_length=200)
    blockName = models.CharField(max_length=200)
    stage = models.CharField(max_length=200)
    variable = models.CharField(max_length=200)
    condition = models.CharField(max_length=200)
    value = models.CharField(max_length=200)
    type = models.CharField(max_length=200)
    order = models.IntegerField()

    def __unicode__(self):
        return str(self.username_chartNum)
    order = models.IntegerField()

class data_parsed(models.Model):
    setid = models.ForeignKey(sett, related_name='data_parsed_setid', primary_key=True)
    setid_hash = models.CharField(max_length=100, db_index = True)
    block = models.CharField(max_length=2000, db_index = True)
    username = models.CharField(max_length=2000, db_index = True)
    time = models.IntegerField(db_index = True)
    time_string = models.CharField(max_length=200, db_index = True)

    def __unicode__(self):
        return str(self.setid)

class unique_variables(models.Model):
    setid = models.ForeignKey(sett, related_name='unique_variables_setid')
    setid_hash = models.CharField(max_length=100, db_index = True)
    block = models.CharField(max_length=200, db_index = True)
    stage = models.CharField(max_length=200, db_index = True)
    variable = models.CharField(max_length=200, db_index = True)
    value = models.CharField(max_length=2000, db_index = True)

    class Meta:
        unique_together = (("setid", "block", "variable", "stage", "value"),)

我正在运行的代码循环遍历data_parsed,其中相关数据在user_chartConfigurationData和unique_variables之间匹配。

#After we get the tab, we will get the configuration data from the config button. We will need the tab ID, which is chartNum, and the actual chart
#That is opened, which is the chartID.
chartIDKey = user_chartConfigurationChartID.objects.get(chartID = chartID)
for i in user_chartConfigurationData.objects.filter(username_chartNum = chartNum, openedConfig = chartIDKey).order_by('order').iterator():
    iterator = data_parsed.objects.all().iterator()

    #We will loop through parsed objects, and at the same time using the setid (unique for all blocks), which contains multiple
    #variables. Using the condition, we can set the variable gte (greater than equal), or lte (less than equal), so that the condition match
    #the setid for the data_parsed object, and variable condition
    for contents in iterator:
        #These are two flags, found is when we already have an entry inside a dictionary that already
        #matches the same setid. Meaning they are the same blocks. For example FlowBranch and FlowPure can belong
        #to the same block. Hence when we find an entry that matches the same id, we will put it in the same dictionary.
        #Added is used when the current item does not map to a previous setid entry in the dictionary. Then we will need
        #to add this new entry to the array of dictionary (set_of_pk_values). Otherwise, we will be adding a lot
        #of entries that doesn't have any values for variables (because the value was added to another entry inside a dictionary)
        found = False
        added = False
        storeItem = {}

        #Initial information for the row
        storeItem['block'] = contents.block
        storeItem['username'] = contents.username
        storeItem['setid'] = contents.setid
        storeItem['setid_hash'] = contents.setid_hash

        if (i.variable != ""):
            for findPrevious in set_of_pk_values:
                if(str(contents.setid) == str(findPrevious['setid'])):
                    try:
                        items = unique_variables.objects.get(setid = contents.setid, variable = i.variable)
                        findPrevious[variableName] = items.value
                        found = True
                        break
                    except:
                        pass
            if(found == False):
                try:
                    items = unique_variables.objects.get(setid = contents.setid, variable = i.variable)
                    storeItem[variableName] = items.value
                    added = True
                except:
                    pass
        if(found == False and added == True):
            storeItem['time_string'] = contents.time_string
            set_of_pk_values.append(storeItem)

我尝试使用select_related()或prefetch_related(),因为它需要转到unique_variables对象并获取一些数据,但是,它仍然需要很长时间。

有没有更好的方法来解决这个问题?

1 个答案:

答案 0 :(得分:2)

当然,请看django_debug_toolbar。它会告诉您执行的查询数量以及持续时间。当我必须优化某些东西=)时,真的无法生存。

PS:执行速度会更慢。

修改:您可能还希望为用于过滤的字段启用db_index,或为index_together启用多个字段。 Ofc,衡量您的更改之间的时间,以确保哪个选项更好。