即使将超时范围指定为足够大(50秒)之后,Cassandra中的连接超时错误

时间:2019-01-16 10:38:57

标签: python python-3.x cassandra

我编写了一个Python脚本,该脚本从CSV文件读取行,然后将其插入Cassandra。可以正常运行,但是在某些插入之后,它将收到超时错误。

# lets do some batch insert
    def insert_data(self):
        start_time = datetime.utcnow()
        destination = "/Users/aviralsrivastava/dev/learning_dask/10M_rows.csv"
        chunksize = 1000
        chunks = pd.read_csv(destination, chunksize=chunksize)
        chunk_counter = 0
        for df in chunks:
            df = df.to_dict(orient='records')
            chunk_counter += 1
            batch = BatchStatement()
            for row in df:
                key = str(row["0"])
                row = json.dumps(row, default=str)
                insert_sql = self.session.prepare(
                    (
                        "INSERT INTO  {} ({}, {}, {}) VALUES (?,?,?)"
                    ).format(
                        self.table_name, "id", "version", "row"
                    )
                )
                batch.add(insert_sql, (key, "version_1", row))
            self.session.execute(batch)
            self.log.info("One chunk's Batch Insert Completed")
            print(
                str(chunk_counter*chunksize) + " : " +
                str(datetime.utcnow() - start_time)
            )
            del batch
        print("Complete task's duration is: {}".format(
            datetime.utcnow() - start_time))

建立连接的代码如下:

def createsession(self):
        self.cluster = Cluster(['localhost'], connect_timeout = 50)
        self.session = self.cluster.connect(self.keyspace)

错误是:

2019-01-16 15:58:49,013 [ERROR] cassandra.cluster: Error preparing query:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 2402, in cassandra.cluster.Session.prepare
  File "cassandra/cluster.py", line 4062, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1
Traceback (most recent call last):
  File "getting_started.py", line 107, in <module>
    example1.insert_data()
  File "getting_started.py", line 86, in insert_data
    self.table_name, "id", "version", "row"
  File "cassandra/cluster.py", line 2405, in cassandra.cluster.Session.prepare
  File "cassandra/cluster.py", line 2402, in cassandra.cluster.Session.prepare
  File "cassandra/cluster.py", line 4062, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1

1 个答案:

答案 0 :(得分:2)

使用批处理功能杀死了Cassandra。 Cassandra中的批次用于特定目的,而不是用于一起提交多个记录(直到它们都属于同一分区)-您可以阅读有关batch misuse in the documentation的信息。更有效的方法是将准备好的语句与通过execute_async异步执行查询一起使用:驱动程序文档has examples的入门部分。在这种情况下,每个查询将转到保留用于特定分区的数据的计算机,并且不会像使用批处理时那样加载协调器节点。

您遇到的另一个错误是您正在循环内准备查询-在第一个for循环之前执行此操作,然后在循环内使用准备好的查询。您可能还需要增加每个连接的进行中请求的数量,以使网络饱和。

P.S。我昨天answered回答了同样的问题,只是针对Java。