在我的工作中,经常出现的一件事是需要对数据集执行中位数,四分位数和百分位数运算。我也被迫使用excel(不是我的选择),并且出于功能原因我也被迫使用数据透视表('dem slicers,hot damn)。
Excel(至少Excel 2010)在其数据透视表中没有此功能。一些附加组件,例如power pivot apparently add this,但不是每个人都有工作环境,您可以随时安装。还有其他一些值得注意的解决方法:
这些变通方法很棒,但是当我有超过10万行的数据时,它开始花费一些时间来处理我的烤面包机 - 而且我想要数据透视表的全部原因是我可以有一个光滑的切片机驱动的仪表板那很漂亮,反应灵敏。同样重要的是,当您获得高于65k +数据点时,使用Excel的条件和公式会破坏它,即使工作表可以达到1M行数据。
基本上,我想要一个强大而可靠的方法来计算数据透视表中给定选定变量的百分位数据,我希望它不会以冰川速度移动。我还希望它在数据子类别> 65K行时工作。
答案 0 :(得分:1)
我的问题解决方案是用VBA编写的。这个答案将分为两部分:
在(几乎)一个句子中:我的方法是数据透视表的目标,在其旁边创建一个适当大小的范围,并使用目标值列向下钻取数据透视表该子类别的数据 - 从这里百分位数据通过QUICKSELECT算法计算并粘贴到命名范围并清理/删除深入数据表。然后洗;冲洗;重复。
在多个句子中:
我们的主要功能( FindPivotPercentiles )将:目标数据透视表,值列的标题字符串作为输入,以计算值,以及可选的百分位数来计算。
我们通过调用 PivotEdgeRange 函数找到结果输出范围来计算我们计算的百分位数,该函数采用目标数据透视表(用于计算输出范围的垂直范围)和百分位数(使用计算范围的水平延伸 - 因此计算0.25,.5和.75百分位数的10类长度数据透视表将返回数据透视表右侧的3 x 10目标范围。
我们还将目标列设置为数据透视表中的最后一个值列,这仅用于深入查看数据,只要选择了适当的类别,所选的实际值无关紧要。
然后我们遍历数据列范围内的行,调用函数 PivotPercentiles 。该函数是实际向下钻入数据透视表的位,获取从调用函数提供的目标单元格,以及我们希望计算的实际值的标题字符串。向下钻取是使用targetCell.ShowDetail = True
执行的,Excel创建的工作表由子程序 autoRenamePivotDetail 自动重命名(我想这不是绝对必要的,但我发现不是超显式关于这些东西后来叮咬我的屁股)。此钻取数据中的目标数据列从先前提供的标头字符串中选择,并循环以创建数组。
使用QUICKSELECT算法从数组( QuickSelect,QSPartition QSSwapElems )计算所需的百分位值,该算法类似QUICKSORT,但在确定时停止已找到第k个值。
返回数据透视表中该行的百分位数值数组后,不再需要向下钻取数据表并将其删除。百分位数返回到顶级调用函数,以粘贴到 FindPivotPercentiles 函数中确定的“结果范围”中,然后该过程将重复下一个可用的数据行,直到所有数据都用完为止。
最后,完成所有这些操作后,找到结果范围和实际数据透视表范围的并集,并给出一个名称,以便每次使用切片器更改显示的数据透视数据时,命名范围是动态更新的,因此任何依赖关系/图形也会更新。
输出看起来有点像这样:
在此示例中,红色范围是向下钻取的目标列(但不一定是用于计算的实际目标数据),绿色是结果范围,蓝色是输出命名范围,供用户以后使用< / p>
Sub FindPivotPercentiles(ByRef targetWorksheet As Worksheet, _
ByRef targetPivot As PivotTable, _
ByRef columnToCalc As String, _
Optional ByRef percentInputs As Variant = -1)
'**********************************************************
'** Finds all Percentile Data for a given pivot table **
'**********************************************************
Dim valueTitle As String ' detail sheet column title to calc values from
Dim targetColumn As Range ' Range of above column
Dim resultColumn As Range ' output range for percentile values
Dim wholePivotRange As Range ' selects entire pivot for naming range
Dim percentiles() As Single ' desired percentile levels to calculate
Dim j As Long ' iterator
Dim k As Long ' iterator
' ------------------------------------------------------------------------------------
' User selectable things, choose the column to calculate percentiles for (eg "DAP")
' ------------------------------------------------------------------------------------
valueTitle = columnToCalc
' ------------------------------------------------------------------------------------
' Initalize the percentile array with default values if no user specified option present
' ------------------------------------------------------------------------------------
If IsArray(percentInputs) = False Then
If percentInputs = -1 Then
ReDim percentiles(1 To 7) As Single
percentiles(1) = 0#
percentiles(2) = 0.1
percentiles(3) = 0.25
percentiles(4) = 0.5
percentiles(5) = 0.75
percentiles(6) = 0.9
percentiles(7) = 1#
End If
Else
' ------ otherwise make them the input values
ReDim percentiles(1 To UBound(percentInputs) - LBound(percentInputs) + 1) As Single
For j = 1 To UBound(percentInputs) - LBound(percentInputs) + 1
percentiles(j) = percentInputs(j - (1 - LBound(percentInputs)))
Next j
End If
Set resultColumn = PivotEdgeRange(targetWorksheet, targetPivot, percentiles)
Set targetColumn = targetPivot.DataBodyRange.Resize(, 1)
' ------------------------------------------------------------------------------------
' Clean up work area, create some percentile headers and appropriately format then,
' then calculate percentile values and paste in area to right of pivot table
' ------------------------------------------------------------------------------------
resultColumn.Resize(, 100).EntireColumn.Clear
targetColumn.Resize(1).Offset(-1, 0).Copy
resultColumn.Resize(1).Offset(-1, 0).PasteSpecial xlPasteFormats
resultColumn.Resize(1).Offset(-1, 0) = percentiles
For j = 1 To targetColumn.Rows.count
resultColumn.Rows(j) = PivotPercentiles(targetColumn.Rows(j), targetWorksheet, valueTitle, percentiles, True)
Debug.Print "calculating row: " & j
Next j
' ------ format nicely, name range for later use
With resultColumn
.NumberFormat = "0.00"
Set resultColumn = .Resize(.Rows.count + 1).Offset(-1, 0)
.Name = valueTitle & "_Percents"
End With
Set wholePivotRange = targetPivot.TableRange1
Set wholePivotRange = Union(wholePivotRange, resultColumn)
wholePivotRange.Name = valueTitle & "_WholeTable"
End Sub
Function PivotEdgeRange(ByRef targetWorksheet As Worksheet, _
ByRef targetPivot As PivotTable, _
Optional ByRef percentInputs As Variant = -1) As Range
'*************************************************************************
'** For a given pivot table, set a range to the right most empty column **
'*************************************************************************
Dim numDataCols As Long ' number of value fields in table
Dim howManyOutputColumns As Integer ' how many percentile calculation cols
If IsArray(percentInputs) = False Then
If percentInputs = -1 Then: howManyOutputColumns = 7
Else
howManyOutputColumns = UBound(percentInputs)
End If
numDataCols = targetPivot.DataBodyRange.Columns.count
Set PivotEdgeRange = targetPivot.DataBodyRange.Resize(, howManyOutputColumns).Offset(, numDataCols)
End Function
Function PivotPercentiles(ByRef targetCell As Range, _
ByRef pivotCacheSheet As Worksheet, _
ByRef valueTitle As String, _
Optional ByRef percentInputs As Variant = -1, _
Optional ByRef suppressErrors As Boolean = False) As Double()
'************************************************************************
'** Finds the Percentile Data for a given grouping in a pivot table **
'************************************************************************
Dim targetColumn As Range ' targetColumn for calculation
Dim numberOfVals As Long ' number of values, used to determine array size
Dim j As Long ' iterator
Dim badData As Boolean ' flags that non-numeric data was in target column
Dim range2Array() As Double ' coverts target range to array for faster calculation
Dim percentiles() As Single ' desired percentile levels to calculate
Dim percentOutputs() As Double ' holds calculated percentiles
Dim targetkth As Long ' kth smallest value to extract from quickselect
' ------------------------------------------------------------------------------------
' Initalize the percentile array with default values if no user specified option
' present. This is already done in FindPivotPercentiles, so is mostly redundant, but
' I have left this in so you may have the option of using this as a stand alone function
' ------------------------------------------------------------------------------------
If IsArray(percentInputs) = False Then
If percentInputs = -1 Then
ReDim percentiles(1 To 7) As Single
percentiles(1) = 0#
percentiles(2) = 0.1
percentiles(3) = 0.25
percentiles(4) = 0.5
percentiles(5) = 0.75
percentiles(6) = 0.9
percentiles(7) = 1#
End If
Else
' ------ otherwise make them the input values
ReDim percentiles(1 To UBound(percentInputs) - LBound(percentInputs) + 1) As Single
For j = 1 To UBound(percentInputs) - LBound(percentInputs) + 1
percentiles(j) = percentInputs(j - (1 - LBound(percentInputs)))
Next j
End If
ReDim percentOutputs(1 To UBound(percentiles)) As Double
' ------------------------------------------------------------------------------------
' Show detail in target cell so we can compute the percentile data values. Show detail
' creates a new sheet with name "Sheet*" - we switch to this, and compute the medians
' in this space, before deleting it.
' ------------------------------------------------------------------------------------
targetCell.ShowDetail = True
Call autoRenamePivotDetail
Set pivotCacheSheet = ThisWorkbook.Sheets("PivotDetail")
Set targetColumn = pivotCacheSheet.ListObjects(1).ListColumns(valueTitle).DataBodyRange
numberOfVals = targetColumn.Rows.count
' ----- convert range to array, quicker computation time
ReDim range2Array(1 To numberOfVals) As Double
For j = 1 To targetColumn.Rows.count
Select Case VarType(targetColumn(j))
Case 2 To 4
range2Array(j) = CDbl(targetColumn(j))
Case 5
range2Array(j) = targetColumn(j)
Case 8
range2Array(j) = CDbl(targetColumn(j))
Case Else
range2Array(j) = Empty
badData = True
End Select
Next j
' ------------------------------------------------------------------------------------
' Now we have an array to play with, loop over desired percentile values and calculate
' If arrays are no larger than 10, the .Percentile_Inc function can be used instead -
' this is neccessary as QuickSelect breaks when less than 3 array size.
' ------------------------------------------------------------------------------------
For j = 1 To UBound(percentiles)
If UBound(range2Array) < 10 Then
percentOutputs(j) = Application.WorksheetFunction.Percentile_Inc(range2Array, percentiles(j))
Else
targetkth = percentiles(j) * UBound(range2Array)
If targetkth = 0 Then: targetkth = 1
percentOutputs(j) = QuickSelect(range2Array, targetkth, 1, UBound(range2Array))
End If
Next j
' ------ clean up
Application.DisplayAlerts = False
pivotCacheSheet.Delete
Application.DisplayAlerts = True
If badData And suppressErrors <> True Then
MsgBox "Bad data (non-numeric) was found in the target range. Please ensure better cleaning of input data."
End If
' ----- output values to calling function
PivotPercentiles = percentOutputs
End Function
Sub autoRenamePivotDetail()
'*****************************************************************************
'** Automatically rename the newly created detailed pivot data. This **
'** relies on all worksheets otherwise having sensible names, it will ruin **
'** your sheet names otherwise **
'*****************************************************************************
Dim detailedPivotFound As Boolean ' True if pivot detail sheet found
Dim wSheet As Worksheet ' Worksheet iterator
Dim renameSheet As Worksheet ' Sheet to actually rename
detailedPivotFound = False
If CheckPresent("PivotDetail", "Worksheet") Then
Err.Raise Number:=2000, _
Source:="AlreadyPresent", _
Description:="Cannot rename worksheet; 'PivotDetail sheet already present"
End If
' ------ cycle through, change name to pivotsheet
For Each wSheet In ThisWorkbook.Worksheets
If wSheet.Name Like "Sheet*" And detailedPivotFound = True Then
Err.Raise Number:=2000, _
Source:="AlreadyPresent", _
Description:="Cannot expand pivot detail; spare 'sheet' name already present."
End If
If wSheet.Name Like "Sheet*" Then
detailedPivotFound = True
Set renameSheet = wSheet
End If
Next wSheet
renameSheet.Name = "PivotDetail"
' ------------------------------------------------------------------------------------
' Error handling for cases which do not have a CheckPresent datatype defined for them
' as of yet, or if multiple "sheet*" names are found
' ------------------------------------------------------------------------------------
AlreadyPresent_End:
Exit Sub
AlreadyPresent_Err:
MsgBox Prompt:="Error number " & Err.Number & " was raised. " & _
vbCrLf & "Source: " & Err.Source & vbCrLf & _
"Description: " & Err.Description
Resume AlreadyPresent_End
End Sub
Function QuickSelect(list() As Double, k As Long, startInterval As Long, endInterval As Long) As Double
'**********************************************************************
'** Uses a partial form of QUICKSELECT to find k'th percentile value **
'**********************************************************************
Dim pivotInterval As Long ' where the current pivotpoint is
Dim splitInterval As Long ' where to split the interval
Dim notFinished As Boolean: notFinished = True ' flips to true once kth val found
' ------------------------------------------------------------------------------------
' Adapted from:
' https://stackoverflow.com/questions/3779763/fast-algorithm-for-computing-percentiles-to-remove-outliers
' More information on the QuickSelect algorithm: https://en.wikipedia.org/wiki/Quickselect
' ------------------------------------------------------------------------------------
While notFinished
pivotInterval = CLng(Floor(startInterval + endInterval) / 2)
' ------ make sure min and max cases handled properly
If k = 1 Then pivotInterval = CLng(Floor(startInterval + endInterval) / 2)
If k = endInterval Then pivotInterval = CLng(Ceiling(startInterval + endInterval) / 2)
splitInterval = QSPartition(list, startInterval, endInterval, pivotInterval)
If k < splitInterval Then
endInterval = splitInterval
ElseIf k > splitInterval Then
startInterval = splitInterval + 1
Else
QuickSelect = list(k)
notFinished = False
End If
' ------ break loop for max case.
If k = startInterval And k = endInterval Then
QuickSelect = list(k)
notFinished = False
End If
Wend
End Function
Function QSPartition(list() As Double, startInterval As Long, endInterval As Long, _
pivotInterval As Long) As Long
'**********************************************************
'** Swaps higher and lower elements within the partition **
'**********************************************************
Dim pivotValue As Double ' the value of the pivot point
Dim newPivot As Double ' the new pivot point after partitioning
Dim storeInterval As Double ' temp value to help swapping pivot value
Dim i As Double ' iterator
pivotValue = list(pivotInterval)
list(pivotInterval) = list(startInterval)
list(startInterval) = pivotValue
storeInterval = startInterval + 1
While (storeInterval < endInterval) And list(storeInterval) <= pivotValue
storeInterval = storeInterval + 1
Wend
For i = storeInterval + 1 To endInterval
If list(i) <= pivotValue Then
Call QSSwapElems(list, i, storeInterval)
storeInterval = storeInterval + 1
End If
Next i
newPivot = storeInterval - 1
list(startInterval) = list(newPivot)
list(newPivot) = pivotValue
QSPartition = newPivot
End Function
Sub QSSwapElems(ByRef list() As Double, ByVal i As Long, ByVal j As Long)
'************************
'** Swap list elements **
'************************
Dim temp As Double
temp = list(i)
list(i) = list(j)
list(j) = temp
End Sub
Function CheckPresent(checkName As String, checkType As String) As Boolean
'*******************************************************************
'** Checks that <checkname> is present in object type <checktype> **
'*******************************************************************
Dim index As Long ' iterator
CheckPresent = False
index = 1
Select Case checkType
Case "Connections"
' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
' Find whether a given named Connection exists
' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Dim conName As Connections
Dim conParent As Variant
Set conParent = ThisWorkbook.Connections
While (CheckPresent <> True And index <= conParent.count)
If conParent(index).Name = checkName Then: CheckPresent = True
index = index + 1
Wend
Case "Worksheet"
' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
' Find whether a given named worksheet exists
' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Dim wName As Worksheet
Dim sheetParents As Sheets
Set sheetParents = ThisWorkbook.Worksheets
While (CheckPresent <> True And index <= sheetParents.count)
If sheetParents(index).Name = checkName Then: CheckPresent = True
index = index + 1
Wend
' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
' handle cases where no data type exists,
' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Case Else
Err.Raise Number:=vbObjectError + 1001, _
Source:="TypeNotPresent", _
Description:="This function does not handle variable type: " & checkType & _
". Please create a case to handle it in the CheckPresent function."
End Select
' ------------------------------------------------------------------------------------
' Error handling for cases which do not have a CheckPresent datatype defined for them
' as of yet.
' ------------------------------------------------------------------------------------
TypeNotPresent_End:
Exit Function
TypeNotPresent_Err:
MsgBox Prompt:="Error number " & Err.Number & " was raised. " & _
vbCrLf & "Source: " & Err.Source & vbCrLf & _
"Description: " & Err.Description
Resume TypeNotPresent_End
End Function