Question

我想使用Apache Spark从HDFS读/写协议缓冲消息。我发现了以下建议的方法：

1）使用Google的Gson库将protobuf消息转换为Json，然后由SparkSql对其进行读写。 this link中介绍了此解决方案，但我认为这样做（转换为json）是一项额外的任务。

2）转换为Parquet文件。有parquet-mr和sparksql-protobuf的github项目可用于这种方式，但我不希望使用镶木地板文件，因为我总是处理所有列（而不是某些列），因此镶木地板格式不会给我带来任何好处（至少我认为）。

3）ScalaPB。可能正是我要找的东西。但是用斯卡拉语言我对此一无所知。我正在寻找基于Java的解决方案。 This youtube video介绍scalaPB并说明如何使用（针对scala开发人员）。

4）通过使用序列文件，这就是我想要的，但是对此一无所获。因此，我的问题是：如何将protobuf消息写到HDFS上的序列文件中？任何其他建议将是有用的。

5）通过Twitter的Elephant-bird库。

Answer 1

尽管两点之间有些隐藏，但您似乎在问如何在spark中写入序列文件。我找到了一个示例here。

// Importing org.apache.hadoop.io package
import org.apache.hadoop.io._

// As we need data in sequence file format to read. Let us see how to write first
// Reading data from text file format
val dataRDD = sc.textFile("/public/retail_db/orders")

// Using null as key and value will be of type Text while saving in sequence file format
// By Int and String, we do not need to convert types into IntWritable and Text
// But for others we need to convert to writable object
// For example, if the key/value is of type Long, we might have to 
// type cast by saying new LongWritable(object)
dataRDD.
  map(x => (NullWritable.get(), x)).
  saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id

// Saving in sequence file with key of type Int and value of type String
dataRDD.
  map(x => (x.split(",")(0).toInt, x.split(",")(1))).
  saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id

如何使用Apache Spark读/写协议缓冲区消息？

1 个答案: