我有两个表:
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
val sourcesFolders = List("/home/mykolavasyliv/tmp/source1/", "/home/mykolavasyliv/tmp/source2/", "/home/mykolavasyliv/tmp/source3/")
// :~/tmp$ tree
// .
// ├── source1
// │ └── person-data-1.csv
// ├── source2
// │ └── person-data-2.csv
// └── source3
// └── person-data-3.csv
// person-data-1.csv:
// source-1-1,Mykola ,23,100
// source-1-2,Jon,34,76
// source-1-3,Marry,25,123
// person-data-2.csv
// source-2-1,Mykola ,23,100
// source-2-2,Jon,34,76
// source-2-3,Marry,25,123
// person-data-3.csv
// source-3-1,Mykola ,23,100
// source-3-2,Jon,34,76
// source-3-3,Marry,25,123
// Read csv files not use schema
val sourceDF = spark.read.csv(sourcesFolders:_*)
sourceDF.show(false)
// +----------+-------+---+---+
// |_c0 |_c1 |_c2|_c3|
// +----------+-------+---+---+
// |source-1-1|Mykola |23 |100|
// |source-1-2|Jon |34 |76 |
// |source-1-3|Marry |25 |123|
// |source-2-1|Mykola |23 |100|
// |source-2-2|Jon |34 |76 |
// |source-2-3|Marry |25 |123|
// |source-3-1|Mykola |23 |100|
// |source-3-2|Jon |34 |76 |
// |source-3-3|Marry |25 |123|
// +----------+-------+---+---+
// Read csv files use schema
val schema = StructType(
List(
StructField("id", StringType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("NotKnow", IntegerType, true)
)
)
val source2DF = spark.read.schema(schema).csv(sourcesFolders:_*)
source2DF.show(false)
// +----------+-------+---+-------+
// |id |name |age|NotKnow|
// +----------+-------+---+-------+
// |source-1-1|Mykola |23 |100 |
// |source-1-2|Jon |34 |76 |
// |source-1-3|Marry |25 |123 |
// |source-2-1|Mykola |23 |100 |
// |source-2-2|Jon |34 |76 |
// |source-2-3|Marry |25 |123 |
// |source-3-1|Mykola |23 |100 |
// |source-3-2|Jon |34 |76 |
// |source-3-3|Marry |25 |123 |
// +----------+-------+---+-------+
是带有日期的表,另外还有[Date Master]
列,通过该列,我们可以确定是否是实际工作日。
+-------------------------------------+--+---+----------+ | Master Date | | | Workday | +-------------------------------------+--+---+----------+ | | | | | | 2020-03-16 00:00:00.000 | | | 1 | | 2020-03-17 00:00:00.000 | | | 1 | | 2020-03-18 00:00:00.000 | | | 1 | | 2020-03-19 00:00:00.000 | | | 1 | | 2020-03-20 00:00:00.000 | | | 1 | | 2020-03-21 00:00:00.000 | | | 0 | | 2020-03-22 00:00:00.000 | | | 0 | | 2020-03-23 00:00:00.000 | | | 1 | | 2020-03-24 00:00:00.000 | | | 1 | | 2020-03-25 00:00:00.000 | | | 1 | | 2020-03-26 00:00:00.000 | | | 1 | | 2020-03-27 00:00:00.000 | | | 1 | | 2020-03-28 00:00:00.000 | | | 0 | | 2020-03-29 00:00:00.000 | | | 0 | | 2020-03-30 00:00:00.000 | | | 1 | | 2020-03-31 00:00:00.000 | | | 1 | +-------------------------------------+--+---+----------+
第二张表[Workday]
是一种绩效表,我们将各种同事的出勤情况存储在办公室中。
+-----------------------------------------+--+--+--+------------------------+----------------------+ | ID | | | | Start Date | End Date | +-----------------------------------------+--+--+--+------------------------+----------------------+ | ---------------------- ---------- ------| | | | | | | 528950 | | | | 2020-03-19 | 2020-03-23 | +-----------------------------------------+--+--+--+------------------------+----------------------+
我写了一个选择,应该使用前面提到的[MAIN]
列值显示[Start Date]
和[End Date]
with 之间的差异。
[Workday]
有趣的部分到了:此选择返回4天:
+-----------------------------------------------+-----------------------------------------+ | ID | Start Date - End Date (Business Days) | +-----------------------------------------------+-----------------------------------------+ | ------- ------------------------------------- | | | 528950 | 4 | +-----------------------------------------------+-----------------------------------------+
但是如果我开始手动进行计算,我将获得3天的时间:
+-------------------------+---------+ | Master Date | Workday | +-------------------------+---------+ | 2020-03-19 00:00:00.000 | 1 | | 2020-03-20 00:00:00.000 | 1 | | 2020-03-21 00:00:00.000 | 0 | | 2020-03-22 00:00:00.000 | 0 | | 2020-03-23 00:00:00.000 | 1 | +-------------------------+---------+
我在做什么错?也许这很容易,但是我陷入了思路。
谢谢。