如何将Spark Dataframe保存到分区的Cassandra表中

时间:2017-06-13 20:58:54

标签: apache-spark cassandra

我有一个分区的Cassandra表:

myDF.distinct().write
    .cassandraFormat(keyspace = "test", table = "details", cluster="cluster")
    .mode(SaveMode.Append)
    .save()

我正在使用Scala 2.11.8和Spark 2.0以及Cassandra。这里的表格由' date'分隔。山坳。那么在这种情况下,如何将数据框保存到此表中?是否有我需要使用的API的Scala代码示例?没有我正在使用的分区和群集:

    public void getSongsList() {
    List<String> fullsongpath = new ArrayList<>();
    Uri allsongsuri = MediaStore.Audio.Media.EXTERNAL_CONTENT_URI;
    String selection = MediaStore.Audio.Media.IS_MUSIC + " != 0";

    Cursor cursor = managedQuery(allsongsuri, null, selection, null, null);

    if (cursor != null) {
        if (cursor.moveToFirst()) {
            do {
                String name = cursor.getString(cursor.getColumnIndex(MediaStore.Audio.Media.DISPLAY_NAME));
                int id = cursor.getInt(cursor.getColumnIndex(MediaStore.Audio.Media._ID));

                String songPath = cursor.getString(cursor.getColumnIndex(MediaStore.Audio.Media.DATA));
                fullsongpath.add(songPath);
                String artistName = cursor.getString(cursor
                        .getColumnIndex(MediaStore.Audio.Media.ARTIST));
                int artistId = cursor.getInt(cursor
                        .getColumnIndex(MediaStore.Audio.Media.ARTIST_ID));
                String albumName = cursor.getString(cursor
                        .getColumnIndex(MediaStore.Audio.Media.ALBUM));
                int albumId = cursor.getInt(cursor
                        .getColumnIndex(MediaStore.Audio.Media.ALBUM_ID));

            } while (cursor.moveToNext());
        }
        cursor.close();
    }
}

这应该保存在流应用程序中的每个微批处理中,以防选择面向性能的API

1 个答案:

答案 0 :(得分:3)

Spark Cassandra Connector自动分区和批处理。最终用户不需要做任何事情。见

Basic overview of how writes happen

或了解更多详情 This tuning overview