我有一个数据集,包含几个名为" 01"的不同文件夹。到" 15"每个文件夹都包含名为" 00-00.txt"的文件。到" 23-59.txt"在他们(每个文件夹描绘1天)。
在文件中我有如下行;
(以!AIVDM
开头的每个条目都是一行,除了第一个,它以数字开头)
1443650400.010568 !AIVDM,1,1,,B,15NOHL0P00J@uq6>h8Jr6?vN2><,0*4B
!AIVDM,1,1,,A,4022051uvOFD>RG7kDCm1iW0088i,0*23
!AIVDM,1,1,,A,23aIhd@P1@PHRwPM<U@`OvN2><,0*4C
!AIVDM,1,1,,A,13n1mSgP00Pgq3TQpibh0?vL2><,0*74
!AIVDM,1,1,,B,177nPmw002:<Tn<gk1toGL60><,0*2B
!AIVDM,1,1,,B,139eu9gP00PugK:N2BOP0?vL2><,0*77
!AIVDM,1,1,,A,13bg8N0P000E2<BN15IKUOvN2><,0*34
!AIVDM,1,1,,B,14bL20003ReKodINRret28P0><,0*16
!AIVDM,1,1,,B,15SkVl001EPhf?VQ5SUTaCnH0><,0*00
!AIVDM,1,1,,A,14eG;ihP00G=4CvL=7qJmOvN0><,0*25
!AIVDM,1,1,,A,14eHMQ@000G<cKrL=6nJ9QfN2><,0*30
我希望得到一个键值对的RDD,长值1443650400.010568
是关键,以!AIVDM...
开头的行是值。我怎样才能做到这一点?
答案 0 :(得分:2)
假设每个文件足够小以包含在单个RDD记录中(不超过2GB),您可以使用SparkContext.wholeTextFiles
将每个文件读入单个记录,然后flatMap
这些记录:
// assuming data/ folder contains folders 00, 01, ..., 15
val result: RDD[(String, String)] = sc.wholeTextFiles("data/*").values.flatMap(file => {
val lines = file.split("\n")
val id = lines.head.split(" ").head
lines.tail.map((id, _))
})
或者,如果该假设不正确(每个单独的文件可能很大,即数百MB或更多),那么您需要更加努力地工作:将所有数据加载到单个RDD中,为数据添加索引,收集&#34; key&#34;每索引,然后使用这些索引为每个数据行找到正确的密钥:
// read files and zip with index to later match each data line to its key
val raw: RDD[(String, Long)] = sc.textFile("data/*").zipWithIndex().cache()
// separate data from ID rows
val dataRows: RDD[(String, Long)] = raw.filter(_._1.startsWith("!AIVDM"))
val idRows: RDD[(String, Long)] = raw.filter(!_._1.startsWith("!AIVDM"))
// collect a map if Index -> ID
val idForIndex = idRows.map { case (row, index) => (index, row.split(" ").head) }.collectAsMap()
// optimization: if idForIndex is very large - consider broadcasting it or not collecting it and using a join
// map each row to its key by looking up the MAXIMUM index which is < then row index
// in other words - find the LAST id record BEFORE the row
val result = dataRows.map { case (row, index) =>
val key = idForIndex.filterKeys(_ < index).maxBy(_._1)._2
(key, row)
}