Question

我正在准备我的数据（52GB），用于在我的本地计算机上执行一些范围查询。

我的数据在BSON文件中。我正在将它转换为spark rdd / dataFrame并将其编写为Cassandra以进行快速查询。

我所拥有的数据没有任何独特的范围选择方式，所以我在rdd数据框中添加了一个column(idx)，它通过调用monotically_increase()并将其写入Cassandra而是唯一的。 / p>

但Cassandra正在将idx值重写为非常大的东西。

train_df = train_df.withColumn("idx", monotonically_increasing_id())

try:
#"CREATE TABLE t (pk int, t int, v text, s text, PRIMARY KEY (pk, t));
        create_table = "CREATE TABLE train ( idx BIGINT, cid BIGINT, img BLOB, PRIMARY KEY (idx, cid));"                                        
        session.execute(create_table)
    except:
        print("create table train failed")
    train_df.write\
        .format("org.apache.spark.sql.cassandra")\
        .mode('append') \
        .option("table", "train") \
        .option("keyspace", "komal")\
        .save()

    Any query indexing above 5000 is returing empty list
    query = "select * from train where idx > 5000 and idx <= 6000 ALLOW FILTERING;"
    result = session.execute(query, timeout=50000000)

    result.current_rows
    []

有人可以帮助解决如何在Cassandra中添加一个唯一的列，以便我运行范围查询吗？

Answer 1

您正在尝试选择一系列分区键（在您的情况下，idx是分区键）。这不是在cassandra中做事的方法，因为分区键＆＃34;选择＆＃34; cassandra实际存储数据的节点。您的查询将涉及扫描所有群集节点 - 这可能会非常缓慢。

如果您需要范围查询 - 您可以在分区内有效地执行这些查询。在您的示例中，t是一个集群列，它定义了该分区中所有条目的顺序（idx）。在磁盘上存储数据排序（因此sstables =排序的字符串表），因此对范围的查询在这里是有效的。

Cassandra没有按预期获得我的数据

1 个答案: