Question

我想从数据库中读取两列，将它们按第一列分组，然后使用Spark将结果插入另一个表中。我的程序是用Java编写的。我尝试了以下方法：

$('.question').html(function(){
    var json = JSON.parse($(this).attr('data-infos'));
    if(json.parent_id === 0){
        return 'it ok';
    }
});

这给了我错误：

public static void aggregateSessionEvents(org.apache.spark.SparkContext sparkContext) {
    com.datastax.spark.connector.japi.rdd.CassandraJavaPairRDD<String, String> logs = javaFunctions(sparkContext)
            .cassandraTable("dove", "event_log", mapColumnTo(String.class), mapColumnTo(String.class))
            .select("session_id", "event");
    logs.groupByKey();
    com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions(logs).writerBuilder("dove", "event_aggregation", null).saveToCassandra();
    sparkContext.stop();
}

我的依赖关系是：

The method cassandraTable(String, String, RowReaderFactory<T>) in the type SparkContextJavaFunctions is not applicable for the arguments (String, String, RowReaderFactory<String>, mapColumnTo(String.class))

我该如何解决这个问题？

Answer 1

改变这个：

C:\Program Files (x86)\company\Campaign_Analyze>"C:\Program Files (x86)\sintec\Ca
mpaign_Analyze\dist\test_zone_A_main.exe"
Traceback (most recent call last):
  File "test_zone_A_main.py", line 9, in <module>
  File "calcs_performence.pyc", line 11, in <module>
  File "test_config_cs.pyc", line 11, in <module>
  File "pandas\__init__.pyc", line 13, in <module>
ImportError: C extension: DLL load failed: The specified module could not be found. not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace' to build the C extensions first.

到：

.cassandraTable("dove", "event_log", mapColumnTo(String.class), mapColumnTo(String.class))

你发送了额外的论据。

Answer 2

要按字段对数据进行分组，请执行以下步骤：

必须将数据检索到该表的JavaRDD中。
必须将所需的列提取为一对，其中键作为第一个，其余数据作为第二个。
使用reduceByKey根据需求聚合值。

之后，可以将数据插入另一个表中或用于进一步处理。

public static void aggregateSessionEvents(SparkContext sparkContext) {
    JavaRDD<Data> datas = javaFunctions(sparkContext).cassandraTable("test", "data",
            mapRowTo(Data.class));
    JavaPairRDD<String, String> pairDatas = datas
            .mapToPair(data -> new Tuple2<>(data.getKey(), data.getValue()));
    pairDatas.reduceByKey((value1, value2) -> value1 + "," + value2);
    sparkContext.stop();
}

如何按Spark中的字段对数据进行分组？

2 个答案: