Question

我有一个从CSV文件导入数据然后处理数据以运行各种计算和设置值的过程。

导入的对象通常数以万计，因此celery处理该过程，首先导入数据，然后运行查询以处理数据。处理大量数据时，处理阶段一直在使用旧的MySQL has gone away。该代码遵循;

class ImportFile(models.Model):

    category = models.ForeignKey('category')

    file = models.FileField()

    def import_file(self):
        result = import_data(self.file)

        if not result.has_errors():
            self.process_data()

    def process_data(self):
        results = Result.objects.select_related('user').filter(
            category=self.category
        ).order_by('finish_time')

        for result in results:
            calculate_points(result)

        bulk_update(results, update_fields=['points'])

据我了解，当您进行查询时，在连接超时之前，您必须使用CONN_MAX_AGE的长度来完成该查询集的工作。因此，我实现了一个过程，以批量获取查询集，以增加在给定时间内完成查询的机会。 CONN_MAX_AGE最初设置为100，所以我增加到300，但仅此一项没有帮助。

因此，为了将流程分为较小的查询集，我将上述内容更改为类似的内容；

def chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i:i+n]

...

def process_data(self):
    # Get a list of all the IDs we need to process
    results = Results.objects.select_related('user').filter(
        category=self.category
    ).order_by('finish_time').values_list('id', flat=True)

    processed_results = []
    batch_size = 1000
    # Split the list of IDs into batches of 1000 IDs
    batches = list(chunks(results, batch_size))

    for batch in batches:
        # Get the full objects for the batch of IDs
        results_batch = Results.objects.select_related(
            'user'
        ).filter(
            id__in=batch,
            category=self.category
        ).order_by('finish_time')

        processed_results += results_batch

        # Iterate over the queryset to run calculations
        for result in results_batch:
            calculate_points(result)

    # Make sure we got everything & save the lot all at once
    assert len(processed_results) == len(results)
    bulk_update(processed_results, update_fields=['points'])

这种在批处理中处理数据的方法在docker中运行时仍导致数据库连接断开（默认MySQL 5.6设置）。是否有更好的方法来批量处理有序数据集，从而避免连接寿命限制？

批处理查询集以避免数据库连接超时

0 个答案: