Question

我需要通过使用Cassandra的Python DataStax驱动程序来插入大量数据。结果，我不能使用execute（）请求。 execute_async（）更快。

但是我遇到了在调用execute_async（）时丢失数据的问题。如果我使用execute（），一切正常。但是，如果我使用execute_async（）（用于SAME插入查询），则只有大约5-7％的请求正确执行了（并且没有发生任何错误）。如果我在每1000个插入请求之后添加time.sleep（0.01）（使用execute_async（）），就可以了。

没有任何数据丢失（情况1）：

for query in queries:
    session.execute( query )

没有任何数据丢失（情况2）：

counter = 0
for query in queries:
    session.execute_async( query )
    counter += 1
    if counter % 1000 == 0:
        time.sleep( 0.01 )

数据丢失：

for query in queries:
    session.execute_async( query )

有什么理由吗？

集群有2个节点

[cqlsh 5.0.1 |卡桑德拉3.11.2 | CQL规范3.4.4 |原生协议v4]

DataStax Python驱动程序版本3.14.0

Python 3.6

Answer 1

由于execute_async是非阻塞查询，因此您的代码在继续操作之前不会等待请求完成。在每次执行后添加10毫秒sleep时，您可能不会观察到数据丢失的原因是，这为读取数据之前留有足够的时间来处理请求。

您的代码中需要一些东西来等待请求完成，然后再读回数据，即：

futures = []
for query in queries:
    futures.push(session.execute(query))

for f in futures:
    f.result() # blocks until query is complete

您可能希望使用execute_concurrent进行评估，以提交许多查询并让驱动程序为您管理并发级别。

Answer 2

当有大量请求等待时，这是cassandra的背压。您应该设置一个连接池，并限制请求数量。另外，您应该添加重试机制，该机制将在将来的回调中从失败情况触发。

Cassandra execute_async请求丢失数据

2 个答案: