根据列

时间:2018-05-29 13:14:19

标签: scala apache-spark apache-spark-sql

这是我的输入数据框

DataPartition   TimeStamp   OrganizationId  SegmentId   GeographicSegment_geographyId   IsSubtracted    Sequence    FFAction|!|
Japan   2018-05-29T09:17:18+00:00   4295876592  27  100002  false   1   O|!|
Japan   2018-05-29T07:52:45+00:00   4295876592  23  null    null    null    D|!|
Japan   2018-05-29T09:17:18+00:00   4295876592  28  100025  false   1   O|!|
Japan   2018-05-29T08:05:17+00:00   4295876592  14  null    null    null    D|!|
Japan   2018-05-29T09:17:18+00:00   4295876592  26  100105  false   1   O|!|
Japan   2018-05-29T09:17:18+00:00   4295876592  6   100131  false   2   O|!|
Japan   2018-05-29T09:17:18+00:00   4295876592  27  112018  false   2   O|!|
Japan   2018-05-29T09:17:18+00:00   4295876592  11  null    null    null    D|!|
Japan   2018-05-29T09:17:18+00:00   4295876592  6   100023  false   1   O|!|
Japan   2018-05-29T08:05:17+00:00   4295876592  25  null    null    null    D|!|
Japan   2018-05-29T09:17:18+00:00   4295876592  29  100029  false   1   O|!|
Japan   2018-05-29T08:05:17+00:00   4295876592  24  null    null    null    D|!|
Japan   2018-05-29T07:52:45+00:00   4295876592  22  null    null    null    D|!|
Japan   2018-05-29T09:11:00+00:00   4295876592  27  100020  false   2   O|!|
Japan   2018-05-29T08:05:17+00:00   4295876592  7   100148  false   1   O|!|
Japan   2018-05-29T08:05:17+00:00   4295876592  21  null    null    null    D|!|

逻辑是,对于相同的OrganizationIdSegmentId列,我需要根据订单获取最新记录 TimeStamp列但有一个条件 条件是,对于相同的OrganizationIdSegmentId我们得到一个TimeStamp然后我需要得到 但如果我得到一个以上的TimeStamp行,那么我只需要获得最新的一行。 例如,SegmentId 27

有三行
Japan   2018-05-29T09:17:18+00:00   4295876592  27  100002  false   1   O|!|
Japan   2018-05-29T09:17:18+00:00   4295876592  27  112018  false   2   O|!|
Japan   2018-05-29T09:11:00+00:00   4295876592  27  100020  false   2   O|!|

所以在上面的例子中,我们有OrganizationIdSegmentId,但有两个TimeStamp所以我需要获得最新的两个Japan 2018-05-29T09:17:18+00:00 4295876592 27 100002 false 1 O|!| Japan 2018-05-29T09:17:18+00:00 4295876592 27 112018 false 2 O|!| 并且预期的输出将是

SegmentId

但在另一种情况下,我们有Japan 2018-05-29T09:17:18+00:00 4295876592 6 100131 false 2 O|!| Japan 2018-05-29T09:17:18+00:00 4295876592 6 100023 false 1 O|!| 6

的两条记录
OrganizationId

在这种情况下,SegmentIdDataPartition TimeStamp OrganizationId SegmentId GeographicSegment_geographyId IsSubtracted Sequence FFAction|!| Japan 2018-05-29T09:17:18+00:00 4295876592 27 100002 false 1 O|!| Japan 2018-05-29T07:52:45+00:00 4295876592 23 null null null D|!| Japan 2018-05-29T09:17:18+00:00 4295876592 28 100025 false 1 O|!| Japan 2018-05-29T08:05:17+00:00 4295876592 14 null null null D|!| Japan 2018-05-29T09:17:18+00:00 4295876592 26 100105 false 1 O|!| Japan 2018-05-29T09:17:18+00:00 4295876592 6 100131 false 2 O|!| Japan 2018-05-29T09:17:18+00:00 4295876592 27 112018 false 2 O|!| Japan 2018-05-29T09:17:18+00:00 4295876592 11 null null null D|!| Japan 2018-05-29T09:17:18+00:00 4295876592 6 100023 false 1 O|!| Japan 2018-05-29T08:05:17+00:00 4295876592 25 null null null D|!| Japan 2018-05-29T09:17:18+00:00 4295876592 29 100029 false 1 O|!| Japan 2018-05-29T08:05:17+00:00 4295876592 24 null null null D|!| Japan 2018-05-29T07:52:45+00:00 4295876592 22 null null null D|!| Japan 2018-05-29T08:05:17+00:00 4295876592 7 100148 false 1 O|!| Japan 2018-05-29T08:05:17+00:00 4295876592 21 null null null D|!| 也相同,但我们只有时间戳,所以我需要保留两列

最后这是我的Ouptut数据框

val windowSpec3 = Window.partitionBy("OrganizationId", "SegmentId", "TimeStamp").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc)
    val latestForEachKey = latestForEachKey2.withColumn("rank", row_number().over(windowSpec3)).filter($"rank" === 1).drop("rank")

这是我尝试使用的代码,但是当我使用此代码时,我会错过具有相同SegmentId且具有相同TimeStamp的记录

Option Explicit

Sub compare_cols()
    Dim Report As Worksheet
    Set Report = Excel.Worksheets("Check_Sheet")

    Dim lastRow As Long
    lastRow = 10

    Dim arrInputCheckSheet As Variant
    arrInputCheckSheet = Array("A", "D") 'I will use these columns to compare against the next array

    Dim arrMDCheckSheet As Variant
    arrMDCheckSheet = Array("B", "E") 'I will use these columns as reference

    Dim j As Long
    j = 13 'start at row 13

    'Application.ScreenUpdating = False 'disable this during debug
    Const firstRow As Long = 3
    Dim a As Long
    For a = LBound(arrInputCheckSheet) To UBound(arrInputCheckSheet)
        Dim i As Long
        For i = firstRow To lastRow
            Dim MatchRow As Long
            If Report.Cells(i, arrInputCheckSheet(a)).Value <> vbNullString Then 'This will omit blank cells at the end (in the event that the column lengths are not equal.

                On Error Resume Next 'match throws an error if nothing matched
                MatchRow = 0
                MatchRow = Application.WorksheetFunction.Match(Report.Cells(i, arrInputCheckSheet(a)).Value, Report.Range(Cells(firstRow, arrMDCheckSheet(a)), Cells(lastRow, arrMDCheckSheet(a))), 0)
                On Error GoTo 0 're-activate error reporting

                If MatchRow = 0 Then
                    'no match
                    With Report.Cells(i, arrInputCheckSheet(a))
                        .Interior.Color = RGB(156, 0, 6) 'Dark red background
                        .Font.Color = RGB(255, 199, 206) 'Light red font color

                        .Offset(0, 2).Value = .Value 'copy value

                        'copy to different sheet
                        Sheets("Check_Sheet").Cells(j, arrControlSheet(a)) = .Value
                        j = j + 1 'increase row counter after each copy
                    End With
                End If
            End If

        Next i
    Next a

    'Application.ScreenUpdating = True
End Sub

1 个答案:

答案 0 :(得分:0)

从partitionBy中删除时间戳并尝试:

val windowSpec3 = Window.partitionBy("OrganizationId", "SegmentId")
.orderBy(unix_timestamp($"TimeStamp", "yyyy-MM- dd'T'HH:mm:ss").cast("timestamp").desc)

val latestForEachKey = df.withColumn("rank", 
dense_rank().over(windowSpec3)).filter($"rank" === 1).drop("rank")