Question

我需要获取转换为RDD的两列数据帧的值。

我想到的第一个解决方案是

首先将RDD转换为行列表RDD.collect()
然后对于List的每个元素，使用Row[i].getInt(column_index)

此解决方案适用于中小型数据。但是在大一点中，我得到了over memory。

我的临时解决方案是我只创建只包含两列而不是所有列的newRDD。然后，应用我上面的解决方案，这可能会减少大部分所需的内存。

目前的实施是这样的：

Row[] rows = sparkDataFrame.collect();
for (int i = 0; i < rows.length; i++) { //about 50 million rows
  int yTrue = rows[i].getInt(0);
  int yPredict = rows[i].getInt(1);
}

你能帮助我改进我的解决方案，还是建议我提供其他解决方案！

谢谢！

ps：我是一个新的火花用户！

Answer 1

首先，将大RDD转换为Dataframe，然后直接选择所需的列。

// Create the DataFrame
DataFrame df = sqlContext.jsonFile("examples/src/main/resources/people.json");

// Select only the "name" column
df.select(df.col("name"), df.col("age")).show();

有关详情，请点击此link

在spark中获取RDD元素的最有效方法是什么？

1 个答案: