当我尝试从原始数据中提取特征时遇到了一些问题。
这是我的数据:
25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
这是我的代码:
val rawData = sc.textFile("data/myData.data")
val lines = rawData.map(_.split(","))
val categoriesMap = lines.map(fields => fields(1)).distinct.collect.zipWithIndex.toMap
以下是错误信息:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 3, localhost): java.lang.ArrayIndexOutOfBoundsException: 1
我想提取第二列作为分类功能,但它似乎无法读取列并导致ArrayIndexOutOfBoundsException。 我尝试了很多次但仍然无法解决问题。
val categoriesMap1 = lines.map(fields => fields(1)).distinct.collect.zipWithIndex.toMap
val labelpointRDD = lines.map { fields =>
val categoryFeaturesArray1 = Array.ofDim[Double](categoriesMap1.size)
val categoryIdx1 = categoriesMap1(fields(1))
categoryFeaturesArray1(categoryIdx1) = 1 }
答案 0 :(得分:1)
您的代码适用于您提供的示例 - 这意味着它适用于"有效"行 - 但您的输入可能包含一些无效行 - 在这种情况下,行没有逗号。
您可以清理数据或改进代码以更优雅地处理这些行,例如对坏行使用某些默认值:
val rawData = sc.parallelize(Seq(
"25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0",
"BAD LINE"
))
val lines = rawData.map(_.split(","))
val categoriesMap = lines.map {
case Array(_, s, _*) => s // for arrays with 2 or more items - use 2nd
case _ => "UNKNOWN" // default
}.distinct().collect().zipWithIndex.toMap
println(categoriesMap) // prints Map(UNKNOWN -> 0, Private -> 1)
更新:根据更新的问题 - 假设这些行确实无效,您可以在提取类别地图和映射到标记点时完全跳过它们:
val secondColumn: RDD[String] = lines.collect {
case Array(_, s, _*) => s // for arrays with 2 or more items - use 2nd
// shorter arrays (bad records) will be ~~filtered out~~
}
val categoriesMap = secondColumn.distinct().collect().zipWithIndex.toMap
val labelpointRDD = secondColumn.map { field =>
val categoryFeaturesArray1 = Array.ofDim[Double](categoriesMap.size)
val categoryIdx1 = categoriesMap(field)
categoryFeaturesArray1(categoryIdx1) = 1
categoryFeaturesArray1
}