使用文本文件第一行的一部分作为RDD中的键

时间:2016-10-15 09:00:56

标签: scala apache-spark rdd

我有一个数据集,包含几个名为" 01"的不同文件夹。到" 15"每个文件夹都包含名为" 00-00.txt"的文件。到" 23-59.txt"在他们(每个文件夹描绘1天)。

在文件中我有如下行; (以!AIVDM开头的每个条目都是一行,除了第一个,它以数字开头)

1443650400.010568 !AIVDM,1,1,,B,15NOHL0P00J@uq6>h8Jr6?vN2><,0*4B
!AIVDM,1,1,,A,4022051uvOFD>RG7kDCm1iW0088i,0*23
!AIVDM,1,1,,A,23aIhd@P1@PHRwPM<U@`OvN2><,0*4C
!AIVDM,1,1,,A,13n1mSgP00Pgq3TQpibh0?vL2><,0*74
!AIVDM,1,1,,B,177nPmw002:<Tn<gk1toGL60><,0*2B
!AIVDM,1,1,,B,139eu9gP00PugK:N2BOP0?vL2><,0*77
!AIVDM,1,1,,A,13bg8N0P000E2<BN15IKUOvN2><,0*34
!AIVDM,1,1,,B,14bL20003ReKodINRret28P0><,0*16
!AIVDM,1,1,,B,15SkVl001EPhf?VQ5SUTaCnH0><,0*00
!AIVDM,1,1,,A,14eG;ihP00G=4CvL=7qJmOvN0><,0*25
!AIVDM,1,1,,A,14eHMQ@000G<cKrL=6nJ9QfN2><,0*30

我希望得到一个键值对的RDD,长值1443650400.010568是关键,以!AIVDM...开头的行是值。我怎样才能做到这一点?

1 个答案:

答案 0 :(得分:2)

假设每个文件足够小以包含在单个RDD记录中(不超过2GB),您可以使用SparkContext.wholeTextFiles将每个文件读入单个记录,然后flatMap这些记录:

// assuming data/ folder contains folders 00, 01, ..., 15
val result: RDD[(String, String)] = sc.wholeTextFiles("data/*").values.flatMap(file => {
  val lines = file.split("\n")
  val id = lines.head.split(" ").head
  lines.tail.map((id, _))
})

或者,如果该假设不正确(每个单独的文件可能很大,即数百MB或更多),那么您需要更加努力地工作:将所有数据加载到单个RDD中,为数据添加索引,收集&#34; key&#34;每索引,然后使用这些索引为每个数据行找到正确的密钥:

// read files and zip with index to later match each data line to its key
val raw: RDD[(String, Long)] = sc.textFile("data/*").zipWithIndex().cache()

// separate data from ID rows 
val dataRows: RDD[(String, Long)] = raw.filter(_._1.startsWith("!AIVDM"))
val idRows: RDD[(String, Long)] = raw.filter(!_._1.startsWith("!AIVDM"))

// collect a map if Index -> ID
val idForIndex = idRows.map { case (row, index) => (index, row.split(" ").head) }.collectAsMap()

// optimization: if idForIndex is very large - consider broadcasting it or not collecting it and using a join

// map each row to its key by looking up the MAXIMUM index which is < then row index 
// in other words - find the LAST id record BEFORE the row
val result = dataRows.map { case (row, index) =>
  val key = idForIndex.filterKeys(_ < index).maxBy(_._1)._2
  (key, row)
}