我编写了一个Python脚本,该脚本从CSV文件读取行,然后将其插入Cassandra。可以正常运行,但是在某些插入之后,它将收到超时错误。
# lets do some batch insert
def insert_data(self):
start_time = datetime.utcnow()
destination = "/Users/aviralsrivastava/dev/learning_dask/10M_rows.csv"
chunksize = 1000
chunks = pd.read_csv(destination, chunksize=chunksize)
chunk_counter = 0
for df in chunks:
df = df.to_dict(orient='records')
chunk_counter += 1
batch = BatchStatement()
for row in df:
key = str(row["0"])
row = json.dumps(row, default=str)
insert_sql = self.session.prepare(
(
"INSERT INTO {} ({}, {}, {}) VALUES (?,?,?)"
).format(
self.table_name, "id", "version", "row"
)
)
batch.add(insert_sql, (key, "version_1", row))
self.session.execute(batch)
self.log.info("One chunk's Batch Insert Completed")
print(
str(chunk_counter*chunksize) + " : " +
str(datetime.utcnow() - start_time)
)
del batch
print("Complete task's duration is: {}".format(
datetime.utcnow() - start_time))
建立连接的代码如下:
def createsession(self):
self.cluster = Cluster(['localhost'], connect_timeout = 50)
self.session = self.cluster.connect(self.keyspace)
错误是:
2019-01-16 15:58:49,013 [ERROR] cassandra.cluster: Error preparing query:
Traceback (most recent call last):
File "cassandra/cluster.py", line 2402, in cassandra.cluster.Session.prepare
File "cassandra/cluster.py", line 4062, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1
Traceback (most recent call last):
File "getting_started.py", line 107, in <module>
example1.insert_data()
File "getting_started.py", line 86, in insert_data
self.table_name, "id", "version", "row"
File "cassandra/cluster.py", line 2405, in cassandra.cluster.Session.prepare
File "cassandra/cluster.py", line 2402, in cassandra.cluster.Session.prepare
File "cassandra/cluster.py", line 4062, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1
答案 0 :(得分:2)
使用批处理功能杀死了Cassandra。 Cassandra中的批次用于特定目的,而不是用于一起提交多个记录(直到它们都属于同一分区)-您可以阅读有关batch misuse in the documentation的信息。更有效的方法是将准备好的语句与通过execute_async
异步执行查询一起使用:驱动程序文档has examples的入门部分。在这种情况下,每个查询将转到保留用于特定分区的数据的计算机,并且不会像使用批处理时那样加载协调器节点。
您遇到的另一个错误是您正在循环内准备查询-在第一个for
循环之前执行此操作,然后在循环内使用准备好的查询。您可能还需要增加每个连接的进行中请求的数量,以使网络饱和。
P.S。我昨天answered回答了同样的问题,只是针对Java。