Question

我有一个数据集，包含几个名为＆＃34; 01＆＃34;的不同文件夹。到＆＃34; 15＆＃34;每个文件夹都包含名为＆＃34; 00-00.txt＆＃34;的文件。到＆＃34; 23-59.txt＆＃34;在他们（每个文件夹描绘1天）。

在文件中我有如下行; （以!AIVDM开头的每个条目都是一行，除了第一个，它以数字开头）

1443650400.010568 !AIVDM,1,1,,B,15NOHL0P00J@uq6>h8Jr6?vN2><,0*4B
!AIVDM,1,1,,A,4022051uvOFD>RG7kDCm1iW0088i,0*23
!AIVDM,1,1,,A,23aIhd@P1@PHRwPM<U@`OvN2><,0*4C
!AIVDM,1,1,,A,13n1mSgP00Pgq3TQpibh0?vL2><,0*74
!AIVDM,1,1,,B,177nPmw002:<Tn<gk1toGL60><,0*2B
!AIVDM,1,1,,B,139eu9gP00PugK:N2BOP0?vL2><,0*77
!AIVDM,1,1,,A,13bg8N0P000E2<BN15IKUOvN2><,0*34
!AIVDM,1,1,,B,14bL20003ReKodINRret28P0><,0*16
!AIVDM,1,1,,B,15SkVl001EPhf?VQ5SUTaCnH0><,0*00
!AIVDM,1,1,,A,14eG;ihP00G=4CvL=7qJmOvN0><,0*25
!AIVDM,1,1,,A,14eHMQ@000G<cKrL=6nJ9QfN2><,0*30

我希望得到一个键值对的RDD，长值1443650400.010568是关键，以!AIVDM...开头的行是值。我怎样才能做到这一点？

Answer 1

假设每个文件足够小以包含在单个RDD记录中（不超过2GB），您可以使用SparkContext.wholeTextFiles将每个文件读入单个记录，然后flatMap这些记录：

// assuming data/ folder contains folders 00, 01, ..., 15
val result: RDD[(String, String)] = sc.wholeTextFiles("data/*").values.flatMap(file => {
  val lines = file.split("\n")
  val id = lines.head.split(" ").head
  lines.tail.map((id, _))
})

或者，如果该假设不正确（每个单独的文件可能很大，即数百MB或更多），那么您需要更加努力地工作：将所有数据加载到单个RDD中，为数据添加索引，收集＆＃34; key＆＃34;每索引，然后使用这些索引为每个数据行找到正确的密钥：

// read files and zip with index to later match each data line to its key
val raw: RDD[(String, Long)] = sc.textFile("data/*").zipWithIndex().cache()

// separate data from ID rows 
val dataRows: RDD[(String, Long)] = raw.filter(_._1.startsWith("!AIVDM"))
val idRows: RDD[(String, Long)] = raw.filter(!_._1.startsWith("!AIVDM"))

// collect a map if Index -> ID
val idForIndex = idRows.map { case (row, index) => (index, row.split(" ").head) }.collectAsMap()

// optimization: if idForIndex is very large - consider broadcasting it or not collecting it and using a join

// map each row to its key by looking up the MAXIMUM index which is < then row index 
// in other words - find the LAST id record BEFORE the row
val result = dataRows.map { case (row, index) =>
  val key = idForIndex.filterKeys(_ < index).maxBy(_._1)._2
  (key, row)
}

使用文本文件第一行的一部分作为RDD中的键

1 个答案: