Question

我编写了一个Python脚本，该脚本从CSV文件读取行，然后将其插入Cassandra。可以正常运行，但是在某些插入之后，它将收到超时错误。

# lets do some batch insert
    def insert_data(self):
        start_time = datetime.utcnow()
        destination = "/Users/aviralsrivastava/dev/learning_dask/10M_rows.csv"
        chunksize = 1000
        chunks = pd.read_csv(destination, chunksize=chunksize)
        chunk_counter = 0
        for df in chunks:
            df = df.to_dict(orient='records')
            chunk_counter += 1
            batch = BatchStatement()
            for row in df:
                key = str(row["0"])
                row = json.dumps(row, default=str)
                insert_sql = self.session.prepare(
                    (
                        "INSERT INTO  {} ({}, {}, {}) VALUES (?,?,?)"
                    ).format(
                        self.table_name, "id", "version", "row"
                    )
                )
                batch.add(insert_sql, (key, "version_1", row))
            self.session.execute(batch)
            self.log.info("One chunk's Batch Insert Completed")
            print(
                str(chunk_counter*chunksize) + " : " +
                str(datetime.utcnow() - start_time)
            )
            del batch
        print("Complete task's duration is: {}".format(
            datetime.utcnow() - start_time))

建立连接的代码如下：

def createsession(self):
        self.cluster = Cluster(['localhost'], connect_timeout = 50)
        self.session = self.cluster.connect(self.keyspace)

错误是：

2019-01-16 15:58:49,013 [ERROR] cassandra.cluster: Error preparing query:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 2402, in cassandra.cluster.Session.prepare
  File "cassandra/cluster.py", line 4062, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1
Traceback (most recent call last):
  File "getting_started.py", line 107, in <module>
    example1.insert_data()
  File "getting_started.py", line 86, in insert_data
    self.table_name, "id", "version", "row"
  File "cassandra/cluster.py", line 2405, in cassandra.cluster.Session.prepare
  File "cassandra/cluster.py", line 2402, in cassandra.cluster.Session.prepare
  File "cassandra/cluster.py", line 4062, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1

Answer 1

使用批处理功能杀死了Cassandra。 Cassandra中的批次用于特定目的，而不是用于一起提交多个记录（直到它们都属于同一分区）-您可以阅读有关batch misuse in the documentation的信息。更有效的方法是将准备好的语句与通过execute_async异步执行查询一起使用：驱动程序文档has examples的入门部分。在这种情况下，每个查询将转到保留用于特定分区的数据的计算机，并且不会像使用批处理时那样加载协调器节点。

您遇到的另一个错误是您正在循环内准备查询-在第一个for循环之前执行此操作，然后在循环内使用准备好的查询。您可能还需要增加每个连接的进行中请求的数量，以使网络饱和。

P.S。我昨天answered回答了同样的问题，只是针对Java。

即使将超时范围指定为足够大（50秒）之后，Cassandra中的连接超时错误

1 个答案: