使用dask和groupby动态计算浮点值?

时间:2019-04-04 10:49:53

标签: python pandas group-by dask

我正在尝试动态计算所有float64列的均值。最终用户应该能够根据图表在任何类别列中进行过滤,并为每个实例获取各种方式。为了实现这一点,我使用dask和groupby函数编写了下面的Python脚本。

但是...

在下面的列上执行groupby时,由于对包含float64类型的列进行汇总和平均值计算,因此从输出CSV文件中消失了对象列。在此示例中,我使用dask读取数据帧(由于内存使用率较高,因此不建议使用熊猫),并将输出文件另存为CSV。

输入CSV列的dtype是:

Dtype                   Columns
String (eg. 2018-P01/02) Time Period
Integer Journey Code
Object  Journey Name
Object  Route
Object  Route Name
Object  Area
Object  Area Name
Object  Fuel Type
Float64 Fuel Load
Float64 Mileage
Float64 Odometer Reading
Float64 Delay Minutes

我用于读取/保存CSV并执行均值计算的代码是:

import numpy as np
import dask.dataframe as dd
import pandas as pd

filename = "H:\\testing3.csv"
data = dd.read_csv("H:\\testing2.csv")
cols=['Time Period','Journey Code','Journey Name','Route',
      'Route Name','Area','Area Name','Fuel Type']
x = data.groupby(cols).aggregation(['mean'])
x.to_csv(filename, index = False)

原始数据集的示例是:

Time Period Journey Code    Route   Route Name      Area    Area Name
2018-P01    261803          High    France-Germany   WE       West
2018-P01-02 325429          High    France-Germany   EA      Eastern
2018-P01-02 359343          High    France-Germany   WCS    West Coast South
2018-P01-02 359343          High    France-Germany   WE     West
2018-P01-03 370697          High    France-Germany   WE     West
2018-P01-04 392535          High    France-Germany   EA     Eastern
2018-P01-04 394752          High    France-Germany   WCS    West Coast South
2018-P01-05 408713          High    France-Germany   WE     West

Fuel Type   Fuel Load   Mileage Odometer Reading    Delay Minutes
Diesel         165         6        14567.1               2
Diesel         210        12        98765.8               0
Diesel         210        5         23406.2               0
Diesel         130        10        54418.8               0
Diesel         152.5      37        58838.35              2
Diesel         142        140       63257.9              37.1194012
Diesel         131.5      120       67677.45              0
Diesel         121        13        72097                1.25

为什么对象列从生成的CSV文件中消失,我如何产生如下所示的结果?

所需的输出(第2行和第3行的示例):起始行没有平均值,但是任何将来的float64值都将包含平均值(当前值与先前值相比)。我将每个实例分别拆分以获得动态结果,但是欢迎提出任何想法。

    Time Period Journey Code    Route   Route Name  Area    Area Name

    2018-P01-02                 
                  325429                
                                 High           
                                        France-Germany      
                                                        EA  
                                                              Eastern
    …….                 

                   359343        High   France-Germany  WCS   West Coast South

Fuel Type   Fuel Load   Mileage Odometer Reading    Delay Minutes

Diesel         210        12        98765.8               0
Diesel         210        12        98765.8               0
Diesel         210        12        98765.8               0
Diesel         210        12        98765.8               0
Diesel         210        12        98765.8               0
......

Diesel         170        8.5       23406.2              NaN
  

编辑:以df.head(10).to_dict()格式添加了示例数据集

{'Time Period': {0: '2018-P01', 1: '2018-P01-02', 2: '2018-P01-02', 3: '2018-P01-02', 4: '2018-P01-03', 5: '2018-P01-04', 6: '2018-P01-04', 7: '2018-P01-05', 8: '2018-P01-06', 9: '2018-P01-07'}, 'Odometer Reading': {0: 14567.1, 1: 98765.8, 2: 23406.2, 3: 54418.8, 4: 58838.35, 5: 63257.9, 6: 67677.45, 7: 72097.0, 8: 89221.0, 9: 89221.0}, 'Journey Code': {0: 261803, 1: 325429, 2: 359343, 3: 359343, 4: 370697, 5: 392535, 6: 394752, 7: 408713, 8: 408714, 9: 408715}, 'Fuel Type': {0: 'Diesel', 1: 'Diesel', 2: 'Diesel', 3: 'Diesel', 4: 'Diesel', 5: 'Diesel', 6: 'Diesel', 7: 'Diesel', 8: 'Diesel', 9: 'Diesel'}, 'Route Name': {0: 'France-Germany', 1: 'France-Germany', 2: 'France-Germany', 3: 'France-Germany', 4: 'France-Germany', 5: 'France-Germany', 6: 'France-Germany', 7: 'France-Germany', 8: 'France-Germany', 9: 'France-Germany'}, 'Area': {0: 'WE', 1: 'EA', 2: 'WCS', 3: 'WE', 4: 'WE', 5: 'EA', 6: 'WCS', 7: 'WE', 8: 'WE', 9: 'WE'}, 'Route': {0: 'High', 1: 'High', 2: 'High', 3: 'High', 4: 'High', 5: 'High', 6: 'High', 7: 'High', 8: 'High', 9: 'High'}, 'Fuel Load': {0: 165.0, 1: 210.0, 2: 170.0, 3: 130.0, 4: 152.5, 5: 142.0, 6: 131.5, 7: 121.0, 8: 121.0, 9: 121.0}, 'Delay Minutes': {0: 2.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 2.0, 5: 37.119401200000006, 6: 0.0, 7: 1.25, 8: 2.56, 9: 2.56}, 'Mileage': {0: 6.0, 1: 12.0, 2: 8.5, 3: 10.0, 4: 37.0, 5: 140.0, 6: 120.0, 7: 13.0, 8: 13.0, 9: 13.0}, 'Area Name': {0: 'West', 1: 'Eastern', 2: 'West Coast South', 3: 'West', 4: 'West', 5: 'Eastern', 6: 'West Coast South', 7: 'West', 8: 'West', 9: 'West'}}

0 个答案:

没有答案