Question

我有一个包含15000000条记录的csv文件，我试图将其处理到cassandra表中。这是列标题和数据的示例：

enter image description here

为了更好地理解它，这是我在python中的模型：

class DIDSummary(Model):
    __keyspace__ = 'processor_api'

    did = columns.Text(required=True, primary_key=True, partition_key=True)
    month = columns.DateTime(required=True, primary_key=True, partition_key=True)
    direction = columns.Text(required=True, primary_key=True)
    duration = columns.Counter(required=True)
    cost = columns.Counter(required=True)

现在我正在尝试处理csv文件的每一行中的数据，并以500,1000,10000,250等批量插入它们，但结果相同（约为.33秒） 1000，这意味着它需要90分钟才能通过所有这些）。我还尝试使用多处理池并apply_async()进行每个batch.execute()调用，没有更好的结果。有没有办法在python中使用 SSTableWriter ，或者做些其他事情将它们更好地插入到cassandra中？作为参考，这是我的process_sheet_row()方法：

def process_sheet_row(self, row, batch):
    report_datetime = '{0}{1:02d}'.format(self.report.report_year, self.report.report_month)
    duration = int(float(row[self.columns['DURATION']]) * 10)
    cost = int(float(row[self.columns['COST']]) * 100000)

    anisummary = DIDSummary.batch(batch).create(did='{}{}'.format(self.report.ani_country_code, row[self.columns['ANI']]),
                                                direction='from',
                                                month=datetime.datetime.strptime(report_datetime, '%Y%m'))
    anisummary.duration += duration
    anisummary.cost += cost
    anisummary.batch(batch).save()

    destsummary = DIDSummary.batch(batch).create(did='{}{}'.format(self.report.dest_country_code, row[self.columns['DEST']]),
                                                 direction='to',
                                                 month=datetime.datetime.strptime(report_datetime, '%Y%m'))
    destsummary.duration += duration
    destsummary.cost += cost
    destsummary.batch(batch).save()

非常感谢任何帮助。谢谢！

编辑：这是我的代码，用于浏览文件并进行处理：

with open(self.path) as csvfile:
    reader = csv.DictReader(csvfile)
    if arr[0] == 'inventory':
            self.parse_inventory(reader)
    b = BatchQuery(batch_type=BatchType.Unlogged)
    i = 1
    for row in reader:
        self.parse_sheet_row(row, b)
        if not i % 1000:
            connection.check_connection() # This just makes sure we're still connected to cassandra. Check code below
            self.pool.apply_async(b.execute())
            b = BatchQuery(batch_type=BatchType.Unlogged)
        i += 1
print "Done processing: {}".format(self.path)
print "Time to Execute: {}".format(datetime.datetime.now() - start)
print "Batches: {}".format(i / 1000)
print "Records processed: {}".format(i - 1)

只是因为这可能有点帮助，这里是connection.check_connection()方法（以及周围的方法）：

def setup_defaults():
    connection.setup(['127.0.0.1'], 'processor_api', lazy_connect=True)

def check_connection():
    from cdr.models import DIDSummary
    try:
        DIDSummary.objects.all().count()
    except CQLEngineException:
        setup_defaults()

Answer 1

批次通常不是执行插入的最快方法。特别是在包含各种分区的未记录批次中。一些阅读批次here

如果你可以离开cqlengine进行插入，你应该尝试在async callback chaining下的Python驱动程序中实现的cassandra.execute_concurrent。

在误用各种尺寸的批次后，我对插入/秒移动到此方法有了重大改进，但YMMV。

使用cqlengine在cassandra中插入和更新大量行的最快且最有效的方法

1 个答案: