由于我是exec gosu $username "$@"
的新朋友,我需要您的帮助。
我的文件夹中包含很多镶木地板文件。这些文件的名称具有相同的格式:Spark Framework
。例如:DD-MM-YYYY
,'01-10-2018'
,'02-10-2018'
等。
我的应用程序有两个输入参数:'03-10-2018'
和dateFrom
。
当我尝试使用下一个代码时,应用程序挂起。似乎应用程序扫描了文件夹中的所有文件。
dateTo
我需要尽可能快地使用数据池。
我认为将时间段划分为几天,然后分别读取文件,并像这样加入它们,将是很好的:
val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*")
.filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()
val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018");
val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018");
val final = mf1.union(mf2).distinct();
和dateFrom
是动态的,所以我不知道现在如何正确组织代码。请帮忙!
@ y2k-shubham我尝试测试下一个代码,但它引发错误:
dateTo
错误:
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
val dateFrom = DateTime.parse("2018-10-01")
val dateTo = DateTime.parse("2018-10-05")
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
val datesInBetween: Seq[DateTime] = getDatesInBetween(dateFrom, dateTo)
val unionDf: DataFrame = datesInBetween.foldLeft(spark.emptyDataFrame) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet("PATH" + date.toString("yyyy-MM-dd") + "/*.parquet"))
}
unionDf.show()
开始时org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 20 columns;
的DateFrame似乎为空。如何解决该问题?
答案 0 :(得分:1)
虽然我还没有测试过这段代码,但是它必须可以工作(可能稍作修改?)
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
// return no of days between two dates
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
// return sequence of dates between two dates
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
// read parquet data of given date-range from given path
// (you might want to pass SparkSesssion in a different manner)
def readDataForDateRange(path: String, from: DateTime, to: DateTime)(implicit spark: SparkSession): DataFrame = {
// get date-range sequence
val datesInBetween: Seq[DateTime] = getDatesInBetween(from, to)
// read data of from-date (needed because schema of all DataFrames should be same for union)
val fromDateDf: DataFrame = spark.read.parquet(path + "/" + datesInBetween.head.toString("yyyy-MM-dd"))
// read and union remaining dataframes (functionally)
val unionDf: DataFrame = datesInBetween.tail.foldLeft(fromDateDf) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet(path + "/" + date.toString("yyyy-MM-dd")))
}
// return union-df
unionDf
}
参考:How to calculate 'n' days interval date in functional style?
答案 1 :(得分:1)
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.{DataFrame, SparkSession}
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
def dateRangeInclusive(start: String, end: String): Iterator[LocalDate] = {
val startDate = LocalDate.parse(start, formatter)
val endDate = LocalDate.parse(end, formatter)
Iterator.iterate(startDate)(_.plusDays(1))
.takeWhile(d => d.isBefore(endDate) || d.isEqual(endDate))
}
val spark = SparkSession.builder().getOrCreate()
val data: DataFrame = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => spark.read.parquet(s"/path/to/directory/${formatter.format(d)}"))
.reduce(_ union _)
我还建议使用本机JSR 310 API(自Java 8开始成为Java SE的一部分),而不要使用joda-time,因为它更现代并且不需要外部依赖。请注意,在这种情况下,首先创建路径序列并执行map + reduce可能比基于foldLeft
的通用解决方案更简单。
此外,您可以使用reduceOption
,如果输入日期范围为空,则将获得Option[DataFrame]
。另外,如果某些输入目录/文件可能丢失,则您需要在调用spark.read.parquet
之前进行检查。如果您的数据位于HDFS上,则可能应该使用Hadoop FS API:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val spark = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(new Configuration(spark.sparkContext.hadoopConfiguration))
val data: Option[DataFrame] = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => s"/path/to/directory/${formatter.format(d)}")
.filter(p => fs.exists(new Path(p)))
.map(spark.read.parquet(_))
.reduceOption(_ union _)