将预测值与实际值匹配

时间:2016-10-26 09:13:07

标签: excel powerpivot powerquery

我收到一份预测给定类别值的每日文件。 在FileDate = FcstDate的情况下,值FcstVal实际上是实际的实际值。 现在我正在使用Excel Power Query(XL'16:Get& Transform)轻松地将几十个文件放到一个类似下面的表格中(400k +行,实际上是18个级别)。

我需要能够说,在1-1,1-2类别AA | AC | null预测分别为1-3,46,44,但实际值为43.同样每隔一行。大多数(但不是全部)唯一行组合在文件之间很常见。最终我将不得不担心处理重命名的级别......

Table.Partition,Table.FillUp,Table.FromPartitions Power Query函数完美地表达了逻辑,但Power Query太慢了,因为它似乎多次读取每个非常大的.xlsx文件( + 1x per因为我需要一个具有所有不同类别级别的索引表,所以会变得更糟糕!预测要分区的日期。

现在我已经在excel表中使用了这个公式: =SUMIFS([ActualVal], [Lvl1],[@[Lvl1]], [Lvl2],[@[Lvl2]], [Lvl3],[@[Lvl3]], [FileDt]],[@[FcstDt]], [@[Eq]]="Y") 但是,这需要将所有空白设置为“空”,更改以“=”或“>”等开头的值,并按小时进行计算。

我一直在努力学习PowerPivot / DAX,因为我知道它能够有效地过滤&计算大数据集。我希望有一个解决方案能够将DAX计算的“上下文”设置为我通过老式excel公式引用的同一行。将值移动到我的“通缉”列中 - 但我还没弄明白。

如果可能的话,我非常喜欢PowerPivot解决方案,但如果没有,我有时可以理解python / pandas。但是,我们仍然坚持使用来自第三方提供商的Excel输入。

    Lvl1 | Lvl2 | Lvl3 | FileDt | FcstDt | Eq | FcstVal | ActualVal | Wanted!
1-1: ________________________________________________________________________
     AA    AB     AD      1-1       1-1    Y     100        100          100
     AA    AC     AE      1-1       1-1    Y      50         50           50
     AA    AB    (null)   1-1       1-2          110                     105
     AA    AC    (null)   1-1       1-2         (null)                    45
     AA    AB    (null)   1-1       1-3          120                     105
     AA    AC    (null)   1-1       1-3           70                      43
1-2 file: ___________________________________________________________________
     AA    AB    (null)   1-2       1-2    Y     105        105          105
     AA    AC    (null)   1-2       1-2    Y      45         45           45
     AA    AB    (null)   1-2       1-3          113                   (null)
     AA    AC    (null)   1-2       1-3           44                      43
1-3 file: ___________________________________________________________________
 (missing row AA|AB!)     1-3       1-3    Y    (null)    (null)       (null)
     AA    AC    (null)   1-3       1-3    Y      43         43           43
     AA    AB    (null)   1-3       1-4          108                   (null)
     AA    AC    (null)   1-3       1-4           42                   (null)

编辑:

我会分享我的代码,因为某些部分可能对其他人有用,而我的问题可能在其他部分。

我的策略是根据打开的Excel中的表格加载一组工作簿。我应用一个简单的函数从工作簿内容中提取我想要的表,然后还应用一个函数尽可能多地对表进行处理,同时仍然分开,认为多线程可以更好地利用,因为它们仍然是独立的(是对的吗?)。

这结束了第一个查询:。我宁愿停在这里并使用PowerPivot,如果它可以做其余的事情(如果需要的话,最后有Table.Combine)。

在Power Query中,我必须将表组合两次。第一个包含所有字段,而第二个是所有表中的一组不同的分组字段(没有值或截止日期字段)。不能使用单个(即第一个)表,因为分组组合可能存在于不在第一个表中的后续表中。反之亦然。这个唯一表得到一个索引。

我通过Table.NestedJoin& amp;加入第二个到第一个仅从连接列中提取索引。这允许我将数据划分为仅具有相同预测日期和数据的分区。组。在这里,我可以使用FillDown,因为这些表在Prep_Data_Table函数中按日期的降序预先排序,因此实际值(如果有的话流向同一组中的其他组,并且没有进一步。)

在此之后,只需重新组合表格。

CODE:

FieldMetadata保存数据类型&订购信息的字段。 来源持有路径名称&是否加载指定的文件。

ImportParameters:

[ThisWB = Excel.CurrentWorkbook()
Sources = ThisWB{[Name="Sources"]}[Content],
FieldMetadata = ThisWB{[Name="FieldMetadata"]},
FieldTypes = Table.ToRows(GetCfg({"Type"})),
CategoryFields = List.Buffer(List.Transform(List.Select(List.Transform(FieldTypes, each {List.First(_), TypeFromString(List.Last(_))}), each List.Last(_) = type text), each List.First(_))),
CategoryFieldTypes = List.Buffer(List.Transform(FieldTypes, (_) => {List.First(_), TypeFromString(List.Last(_))}))

GetCfg:

let
    Cfg = (Columns as list) as table =>
let
    FilterList = List.Transform(Columns, each "[" & _ & "]" <> null"),
    ExpressionText = Text.Combine(FilterList, " and "),
    Source = Excel.CurrentWorkbook(){Name="FieldMetadata"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source, {{"Field", type text}, {"Type", type text"}, {"Grouping", Int32.Type}, {"Presentation"}, Int32.Type}}),
    Custom1 = Table.SelectColumns(#"Changed Type", List.Combine({{"Field"}, Columns})),
    #"Filtered Rows" = Table.SelectRows(Custom1, each Expression.Evaluate(ExpressionText, [_=_]))
        /* The above line is a bit of a mind bender. It lets me apply filteres without hard-coding column names. Very useful.
           Credit to http://www.thebiccountant.com/2016/03/08/select-rows-that-have-no-empty-fields-using-expression-evaluate-in-power-bi-and-power-query/
        */
in
    #"Filtered Rows"
in
    Cfg

FieldSortOrder

let
    SortOn = (SortOn as text) as list =>
let
    Source = ImportParameters[FieldMetadata],
    #"Changed Type" = Table.TransformColumnTypes(Source, {{"Field", type text}, {"Grouping", type number}}),
    SelectedSort = Table.SelectXolumns(Source, {"Field", SortOn}),
    RenamedSortColumn = Table.RenameColumns(SelectedSort, {{SortOn, "Sort"}}),
    NoNulls = Table.SelectRows(RenamedSortColumn, each ([Sort] <> null)),
    SortedFields = Table.Sort(NoNulls, {{"Sort", Order.Ascending}})[Field]
in
    SortedFields
in
    SortOn

TypeFromString

let
    Type = (TypeName as text) as type =>
let
    TypeNameFix = if TypeName = "Table" then "_Table" else TypeName, // because Table is a reserved word
TypR = [Any=Any.Type,
        Binary=Binary.Type, // The whole list of types I could find.
        ...
        _Table=Table.Type,
        ...
        WebMethod=WebMethod.Type],
    TheType = try Record.Field(TypR, TypeNameFix) otherwise error [Reason="TypeName not found", Message="Parameter was not found among the list of types defined within the TypeFromString function.",
in
    TheType
in
    Type

Extract_Data_Table:

let
    Source = (Table as table) as table =>
let
    #"Filtered Rows" = Table.SelectRows(Table, each ([Kind] = "Table" and ([Item] = "Report Data" or [Item] = "Report_Data"))),
    #"Select Columns" = Table.SelectColumns(#"Filtered Rows", "Data"),
    DataTable = #"Select Columns"[Data]{0}
in
    DataTable
in
    Source

Prep_Data_Table:

let
    PrepParams = (HorizonEnd as date, CategoryFieldTypes as list) as function =>
let
    HorizonEnd = HorizonEnd,
    CategoryFieldTypes = List.Buffer(CategoryFieldTypes),
    Source = (InputTable as table, FileDate as date) as table =>
let
    EndFields = {"As-of Date", "PERIOD", "Actual", "Forecast"} as list,
    PeriodsAsDates = Table.TransformColumnTypes(InputTable, {{"PERIOD", type date}}),
    #"Remove Errors" = Table.RemoveRowsWithErrors(PeriodsAsDates, {"PERIOD"}),
    WithinHorizon = Table.SelectRows(#"Remove Errors", each ([PERIOD] <= HorizonEnd)),
    RenamedVAL = Table.RenameColumns(WithinHorizon, {"VAL", "Forecast"}), // Forecast was originally named VAL
    MovedActual = Table.AddColumn(RenamedVAL, "Actual", each if [PERIOD]=FileDate then (if [Forecast] is null then 0 else [Forecast]) else null),
    IncludesOfDate = Table.AddColumn(MovedActual, "As-of Date", each FileDate, Date.Type),
    AppliedCategoryFieldTypes = Table.TransformColumnTypes(IncludeAsOfDate, CategoryFieldTypes),
    TransformedColumns = Table.TransformColumns(AppliedCategoryFieldTypes, {{"{Values}", Text.Trim, type text}, {"Actual", Number.Abs, Currency.Type}, {"Forecast", Number.Abs, Currency.Type}}),
    Sorted = Table.Sort(TransformedColumns, {{"Actual", Order.Descending}}), // Descending order is important because Table.FillDown is more efficient than Table.FillUp
    OutputTable = Table.SelectColumns(Sorted, List.Distinct(List.Combine({List.Transform(CategoryFieldTypes, each List.First(_)), EndFields}))),
    Output = OutputTable
in
    Output
in
    Source
in
    PrepParams

工作簿:

let
// Import Data
    Source = ImportParameters[Sources],
    #"Changed Type" = Table.TransformColumnTypes(Source, {{"As-of Date", type date}, {"Folder Path", type text}, {"Tab", type text}, {"Load", type logical}}),
    #"Filtered Rows"= Table.SelectRows(#"Changed Type", each ([Load] = true)),
    WorkbookPaths = Table.AddColumn(#"Filtered Rows", "File Path", each [Folder Path] & [File], type text),
    LoadWorkbooks = Table.AddColumn(WorkbookPaths, "Data", each Excel.Workbook(File.Contents([File Path])) meta [#"As-of Date" = [#"As-of Date"]]),
    LoadDataTables = Table.TransformColumns(LoadWorkbooks, {"Data", each Extract_Data_Table(_) meta [#"As-of Date" = Value.Metadata(_)[#"As-of Date"]]}),
    PrepFunc = Prep_Data_Table(List.Max(LoadDataTables[#"As-of Date"]), ImportParameters[CategoryFieldTypes]),
    // This TransformColumns step references the column's list, not the table, so the As-of Date field of the column is out of scope. Use metadata to bring the As-of Date value into the same scope

    PrepDataTables = Table.TransformColumns(LoadDataTables, {"Data", each Table.Buffer(PrepFunc(_, Value.Metadata(_)[#"As-of Date"]))}),
    Output = Table.SelectColumns(PrepDataTables, {"Data", "As-of Date"})
in
    Output

MakeComparison:

let
    CategoryFields = ImportParameters[CategoryFields]
    DataTableList = Workbooks[Data],
    CategoryIndex = Table.AddIndexColumn(Table.Distinct(Table.Combine(List.Transform(DataTableList, each Table.SelectColumns(_, CategoryFields)))), "Index"),
    ListOfDataTablesWithNestedIndexTable = List.Transform(DataTableList, each Table.NestedJoin(_, CategoryFields, CategoryIndex, CategoryFields, "Index", JoinKind.Inner)),
    ListOfIndexedDataTables = List.Transform(ListOfDataTablesWithNestedIndexTable, each Table.TransformColumns(_, {"Index", each List.Single(Table.Column(_, "Index")) as number, type number})),
    Appended = Table.Combine(ListOfIndexedDataTables),
    Merged = Table.Join(CategoryIndex, "Index", Table.SelectColumns(Appended, {"As-of Date", "Actual", "Forecast", "Index"}), "Index"),
    Partitioned = Table.Partition(Merged, "Index", Table.RowCount(CategoryIndex), each _),
    CopiedActuals = List.Transform(Partitioned, each Table.FillDown(_, {"Actual"})),
    ToUnpartition = List.Transform(CopiedActuals, each {List.First(_[Index]), Table.RemoveColumns(_, {"Index"})}),
    UnPartitioned = Table.FromPartitions("Index", ToUnpartition, type number),
    Output = Unpartitioned
in
    Output

问题:是否有资格成为关闭?

问题:我是否使用Table.FromPartitions或仅使用Table.Combine来重组表格是否重要?有什么区别?

问题:快速数据加载真正做了什么?什么时候/它没有什么区别?

问题:指定所有类型是否有任何性能优势(x表格,y表示列表,z表示数字等)?

问题:我在一些文档中读到,let..in只是记录中的语法糖。我开始喜欢记录,因为所有中间值都可用。任何性能影响?

问题:数字类型之间有什么区别? Int32.Type与Int64.Type?

2 个答案:

答案 0 :(得分:0)

XLSX文件的大小有多大?我同意您的观点,即我们可能每行打开一次文件。鉴于XLSX是一种存档格式,每张工作表都是一个大文件,在文件中搜索将会非常慢。

特别是如果总数少于RAM的一半,并且如果您正在运行64位办公室,则可以通过调用来自XLSX的表上的Table.Buffer来显着提高Power Query性能。

或者,如果您可以某种方式将XLSX数据转换为CSV源,那么您每次都不需要付出代价来取消破解XLSX文件。或者,如果您可以将数据加载到具有列索引的Sql Server等源,那么这应该可以真正加快查询速度。 (我们通常&#34;查询折叠&#34;操作到Sql Server,它在查询引擎中具有比我们在Power Query中创建的更强大的性能启发式。)您可以使用相反,Power Pivot引擎,但我对此并不十分熟悉。

答案 1 :(得分:0)

一个单独的性能优化:我们已经像这样实现了Table.FillUp:

table => Reverse(FillDown(Reverse(table)))

从性能角度来看这非常糟糕。如果您可以执行一次FillUp操作,请保存数据,然后查询新数据,这将有助于查询性能。