这是我的输入数据框
DataPartition TimeStamp OrganizationId SegmentId GeographicSegment_geographyId IsSubtracted Sequence FFAction|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 27 100002 false 1 O|!|
Japan 2018-05-29T07:52:45+00:00 4295876592 23 null null null D|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 28 100025 false 1 O|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 14 null null null D|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 26 100105 false 1 O|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 6 100131 false 2 O|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 27 112018 false 2 O|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 11 null null null D|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 6 100023 false 1 O|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 25 null null null D|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 29 100029 false 1 O|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 24 null null null D|!|
Japan 2018-05-29T07:52:45+00:00 4295876592 22 null null null D|!|
Japan 2018-05-29T09:11:00+00:00 4295876592 27 100020 false 2 O|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 7 100148 false 1 O|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 21 null null null D|!|
逻辑是,对于相同的OrganizationId
和SegmentId
列,我需要根据订单获取最新记录
TimeStamp列但有一个条件
条件是,对于相同的OrganizationId
和SegmentId
我们得到一个TimeStamp
然后我需要得到
但如果我得到一个以上的TimeStamp行,那么我只需要获得最新的一行。
例如,SegmentId
27
Japan 2018-05-29T09:17:18+00:00 4295876592 27 100002 false 1 O|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 27 112018 false 2 O|!|
Japan 2018-05-29T09:11:00+00:00 4295876592 27 100020 false 2 O|!|
所以在上面的例子中,我们有OrganizationId
和SegmentId
,但有两个TimeStamp
所以我需要获得最新的两个Japan 2018-05-29T09:17:18+00:00 4295876592 27 100002 false 1 O|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 27 112018 false 2 O|!|
并且预期的输出将是
SegmentId
但在另一种情况下,我们有Japan 2018-05-29T09:17:18+00:00 4295876592 6 100131 false 2 O|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 6 100023 false 1 O|!|
6
OrganizationId
在这种情况下,SegmentId
和DataPartition TimeStamp OrganizationId SegmentId GeographicSegment_geographyId IsSubtracted Sequence FFAction|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 27 100002 false 1 O|!|
Japan 2018-05-29T07:52:45+00:00 4295876592 23 null null null D|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 28 100025 false 1 O|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 14 null null null D|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 26 100105 false 1 O|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 6 100131 false 2 O|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 27 112018 false 2 O|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 11 null null null D|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 6 100023 false 1 O|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 25 null null null D|!|
Japan 2018-05-29T09:17:18+00:00 4295876592 29 100029 false 1 O|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 24 null null null D|!|
Japan 2018-05-29T07:52:45+00:00 4295876592 22 null null null D|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 7 100148 false 1 O|!|
Japan 2018-05-29T08:05:17+00:00 4295876592 21 null null null D|!|
也相同,但我们只有时间戳,所以我需要保留两列
最后这是我的Ouptut数据框
val windowSpec3 = Window.partitionBy("OrganizationId", "SegmentId", "TimeStamp").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc)
val latestForEachKey = latestForEachKey2.withColumn("rank", row_number().over(windowSpec3)).filter($"rank" === 1).drop("rank")
这是我尝试使用的代码,但是当我使用此代码时,我会错过具有相同SegmentId且具有相同TimeStamp的记录
Option Explicit
Sub compare_cols()
Dim Report As Worksheet
Set Report = Excel.Worksheets("Check_Sheet")
Dim lastRow As Long
lastRow = 10
Dim arrInputCheckSheet As Variant
arrInputCheckSheet = Array("A", "D") 'I will use these columns to compare against the next array
Dim arrMDCheckSheet As Variant
arrMDCheckSheet = Array("B", "E") 'I will use these columns as reference
Dim j As Long
j = 13 'start at row 13
'Application.ScreenUpdating = False 'disable this during debug
Const firstRow As Long = 3
Dim a As Long
For a = LBound(arrInputCheckSheet) To UBound(arrInputCheckSheet)
Dim i As Long
For i = firstRow To lastRow
Dim MatchRow As Long
If Report.Cells(i, arrInputCheckSheet(a)).Value <> vbNullString Then 'This will omit blank cells at the end (in the event that the column lengths are not equal.
On Error Resume Next 'match throws an error if nothing matched
MatchRow = 0
MatchRow = Application.WorksheetFunction.Match(Report.Cells(i, arrInputCheckSheet(a)).Value, Report.Range(Cells(firstRow, arrMDCheckSheet(a)), Cells(lastRow, arrMDCheckSheet(a))), 0)
On Error GoTo 0 're-activate error reporting
If MatchRow = 0 Then
'no match
With Report.Cells(i, arrInputCheckSheet(a))
.Interior.Color = RGB(156, 0, 6) 'Dark red background
.Font.Color = RGB(255, 199, 206) 'Light red font color
.Offset(0, 2).Value = .Value 'copy value
'copy to different sheet
Sheets("Check_Sheet").Cells(j, arrControlSheet(a)) = .Value
j = j + 1 'increase row counter after each copy
End With
End If
End If
Next i
Next a
'Application.ScreenUpdating = True
End Sub
答案 0 :(得分:0)
从partitionBy中删除时间戳并尝试:
val windowSpec3 = Window.partitionBy("OrganizationId", "SegmentId")
.orderBy(unix_timestamp($"TimeStamp", "yyyy-MM- dd'T'HH:mm:ss").cast("timestamp").desc)
val latestForEachKey = df.withColumn("rank",
dense_rank().over(windowSpec3)).filter($"rank" === 1).drop("rank")