如何一次从多个镶木地板文件中获取数据?

时间:2018-12-24 15:19:43

标签: scala apache-spark

由于我是exec gosu $username "$@"的新朋友,我需要您的帮助。

我的文件夹中包含很多镶木地板文件。这些文件的名称具有相同的格式:Spark Framework。例如:DD-MM-YYYY'01-10-2018''02-10-2018'等。

我的应用程序有两个输入参数:'03-10-2018'dateFrom

当我尝试使用下一个代码时,应用程序挂起。似乎应用程序扫描了文件夹中的所有文件。

dateTo

我需要尽可能快地使用数据池。

我认为将时间段划分为几天,然后分别读取文件,并像这样加入它们,将是很好的:

val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*")
         .filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()

val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018"); val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018"); val final = mf1.union(mf2).distinct(); dateFrom是动态的,所以我不知道现在如何正确组织代码。请帮忙!


@ y2k-shubham我尝试测试下一个代码,但它引发错误:

dateTo

错误

import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}

val dateFrom = DateTime.parse("2018-10-01")
val dateTo = DateTime.parse("2018-10-05")

def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays

def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
    val days = getDaysInBetween(from, to)
    (0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}

val datesInBetween: Seq[DateTime] = getDatesInBetween(dateFrom, dateTo)

val unionDf: DataFrame = datesInBetween.foldLeft(spark.emptyDataFrame) { (intermediateDf: DataFrame, date: DateTime) =>
    intermediateDf.union(spark.read.parquet("PATH" + date.toString("yyyy-MM-dd") + "/*.parquet"))
}
unionDf.show()

开始时org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 20 columns; 的DateFrame似乎为空。如何解决该问题?

2 个答案:

答案 0 :(得分:1)

虽然我还没有测试过这段代码,但是它必须可以工作(可能稍作修改?)

import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}

// return no of days between two dates
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays

// return sequence of dates between two dates
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
  val days = getDaysInBetween(from, to)
  (0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}

// read parquet data of given date-range from given path
// (you might want to pass SparkSesssion in a different manner)
def readDataForDateRange(path: String, from: DateTime, to: DateTime)(implicit spark: SparkSession): DataFrame = {
  // get date-range sequence
  val datesInBetween: Seq[DateTime] = getDatesInBetween(from, to)

  // read data of from-date (needed because schema of all DataFrames should be same for union)
  val fromDateDf: DataFrame = spark.read.parquet(path + "/" + datesInBetween.head.toString("yyyy-MM-dd"))

  // read and union remaining dataframes (functionally)
  val unionDf: DataFrame = datesInBetween.tail.foldLeft(fromDateDf) { (intermediateDf: DataFrame, date: DateTime) =>
    intermediateDf.union(spark.read.parquet(path + "/" + date.toString("yyyy-MM-dd")))
  }

  // return union-df
  unionDf
}

参考:How to calculate 'n' days interval date in functional style?

答案 1 :(得分:1)

import java.time.LocalDate
import java.time.format.DateTimeFormatter

import org.apache.spark.sql.{DataFrame, SparkSession}

val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")

def dateRangeInclusive(start: String, end: String): Iterator[LocalDate] = {
  val startDate = LocalDate.parse(start, formatter)
  val endDate = LocalDate.parse(end, formatter)
  Iterator.iterate(startDate)(_.plusDays(1))
    .takeWhile(d => d.isBefore(endDate) || d.isEqual(endDate))
}

val spark = SparkSession.builder().getOrCreate()
val data: DataFrame = dateRangeInclusive("2018-10-01", "2018-10-05")
  .map(d => spark.read.parquet(s"/path/to/directory/${formatter.format(d)}"))
  .reduce(_ union _)

我还建议使用本机JSR 310 API(自Java 8开始成为Java SE的一部分),而不要使用joda-time,因为它更现代并且不需要外部依赖。请注意,在这种情况下,首先创建路径序列并执行map + reduce可能比基于foldLeft的通用解决方案更简单。

此外,您可以使用reduceOption,如果输入日期范围为空,则将获得Option[DataFrame]。另外,如果某些输入目录/文件可能丢失,则您需要在调用spark.read.parquet之前进行检查。如果您的数据位于HDFS上,则可能应该使用Hadoop FS API:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

val spark = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(new Configuration(spark.sparkContext.hadoopConfiguration))
val data: Option[DataFrame] = dateRangeInclusive("2018-10-01", "2018-10-05")
  .map(d => s"/path/to/directory/${formatter.format(d)}")
  .filter(p => fs.exists(new Path(p)))
  .map(spark.read.parquet(_))
  .reduceOption(_ union _)