合并文件

时间:2017-03-09 09:25:07

标签: scala apache-spark rdd

我是scala的新手。我有两个RDD,我需要将我的培训和测试数据分开。在一个文件中,我拥有所有数据,而在另一个文件中只包含测试数据。我需要从完整的数据集中删除测试数据。

完整的数据文件格式为(userID,MovID,Rating,Timestamp):

res8: Array[String] = Array(1, 31, 2.5, 1260759144)

测试数据文件的格式为(userID,MovID):

res10: Array[String] = Array(1, 1172)

如何生成不会与测试数据集匹配的rating_train 我使用以下函数,但返回的列表显示为空:

  def create_training(data: RDD[String], ratings_test: RDD[String]): ListBuffer[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(",")).collect()
var ratings_train = new ListBuffer[Array[String]]()
ratings_split.foreach(x => {
  ratings_testing.foreach(y => {
    if (x(0) != y(0) || x(1) != y(1)) {
      ratings_train += x
    }
  })
})
return ratings_train

}

编辑:更改了代码但遇到内存问题。

1 个答案:

答案 0 :(得分:0)

这可能有效。

def create_training(data: RDD[String], ratings_test: RDD[String]): Array[Array[String]] = {
  val ratings_split = dropheader(data).map(line => line.split(","))
  val ratings_testing = dropheader(ratings_test).map(line => line.split(","))

  ratings_split.filter(x => {
    ratings_testing.exists(y =>
      (x(0) == y(0) && x(1) == y(1))
    ) == false
  })
}
  1. 您发布的代码段在逻辑上不正确。如果行中没有测试数据,则该行只是最终数据的一部分。但是在代码中,如果与任何测试数据不匹配,则选择该行。但我们应检查它是否与所有测试数据不匹配,然后我们才能决定它是否是有效行。
  2. 您正在使用RDD,但现在正在探索它们的全部功能。我猜你正在读取csv文件的输入。然后,您可以在RDD中构建数据,不需要根据逗号字符吐出字符串并手动将它们作为ROW处理。你可以看一下spark的DataFrame API。这些链接可能有所帮助:https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htmhttp://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
  3. 使用Regex:

      def main(args: Array[String]): Unit = {
        // creating test data set
        val data = spark.sparkContext.parallelize(Seq(
          //      "userID, MovID, Rating, Timestamp",
          "1, 31, 2.5, 1260759144",
          "2, 31, 2.5, 1260759144"))
    
        val ratings_test = spark.sparkContext.parallelize(Seq(
          //      "userID, MovID",
          "1, 31",
          "2, 30",
          "30, 2"
        ))
    
        val result = getData(data, ratings_test).collect()
        // the result will only contain "2, 31, 2.5, 1260759144"
      }
    
      def getData(data: RDD[String], ratings_test: RDD[String]): RDD[String] = {
        val ratings = dropheader(data)
        val ratings_testing = dropheader(ratings_test)
    
        // Broadcasting the test rating data to all spark nodes, since we are collecting this before hand.
        // The reason we are collecting the test data is to avoid call collect in the filter logic
        val ratings_testing_bc = spark.sparkContext.broadcast(ratings_testing.collect.toSet)
    
        ratings.filter(rating => {
          ratings_testing_bc.value.exists(testRating => regexMatch(rating, testRating)) == false
        })
      }
    
      def regexMatch(data: String, testData: String): Boolean = {
        // Regular expression to find first two columns
        val regex = """^([^,]*), ([^,\r\n]*),?""".r
    
        val (dataCol1, dataCol2) = regex findFirstIn data match {
          case Some(regex(col1, col2)) => (col1, col2)
        }
    
        val (testDataCol1, testDataCol2) = regex findFirstIn testData match {
          case Some(regex(col1, col2)) => (col1, col2)
        }
    
        (dataCol1 == testDataCol1) && (dataCol2 == testDataCol2)
      }