我正在尝试动态计算所有float64列的均值。最终用户应该能够根据图表在任何类别列中进行过滤,并为每个实例获取各种方式。为了实现这一点,我使用dask和groupby函数编写了下面的Python脚本。
但是...
在下面的列上执行groupby时,由于对包含float64类型的列进行汇总和平均值计算,因此从输出CSV文件中消失了对象列。在此示例中,我使用dask读取数据帧(由于内存使用率较高,因此不建议使用熊猫),并将输出文件另存为CSV。
输入CSV列的dtype是:
Dtype Columns
String (eg. 2018-P01/02) Time Period
Integer Journey Code
Object Journey Name
Object Route
Object Route Name
Object Area
Object Area Name
Object Fuel Type
Float64 Fuel Load
Float64 Mileage
Float64 Odometer Reading
Float64 Delay Minutes
我用于读取/保存CSV并执行均值计算的代码是:
import numpy as np
import dask.dataframe as dd
import pandas as pd
filename = "H:\\testing3.csv"
data = dd.read_csv("H:\\testing2.csv")
cols=['Time Period','Journey Code','Journey Name','Route',
'Route Name','Area','Area Name','Fuel Type']
x = data.groupby(cols).aggregation(['mean'])
x.to_csv(filename, index = False)
原始数据集的示例是:
Time Period Journey Code Route Route Name Area Area Name
2018-P01 261803 High France-Germany WE West
2018-P01-02 325429 High France-Germany EA Eastern
2018-P01-02 359343 High France-Germany WCS West Coast South
2018-P01-02 359343 High France-Germany WE West
2018-P01-03 370697 High France-Germany WE West
2018-P01-04 392535 High France-Germany EA Eastern
2018-P01-04 394752 High France-Germany WCS West Coast South
2018-P01-05 408713 High France-Germany WE West
Fuel Type Fuel Load Mileage Odometer Reading Delay Minutes
Diesel 165 6 14567.1 2
Diesel 210 12 98765.8 0
Diesel 210 5 23406.2 0
Diesel 130 10 54418.8 0
Diesel 152.5 37 58838.35 2
Diesel 142 140 63257.9 37.1194012
Diesel 131.5 120 67677.45 0
Diesel 121 13 72097 1.25
为什么对象列从生成的CSV文件中消失,我如何产生如下所示的结果?
所需的输出(第2行和第3行的示例):起始行没有平均值,但是任何将来的float64值都将包含平均值(当前值与先前值相比)。我将每个实例分别拆分以获得动态结果,但是欢迎提出任何想法。
Time Period Journey Code Route Route Name Area Area Name
2018-P01-02
325429
High
France-Germany
EA
Eastern
…….
359343 High France-Germany WCS West Coast South
Fuel Type Fuel Load Mileage Odometer Reading Delay Minutes
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
......
Diesel 170 8.5 23406.2 NaN
编辑:以df.head(10).to_dict()格式添加了示例数据集
{'Time Period': {0: '2018-P01', 1: '2018-P01-02', 2: '2018-P01-02', 3: '2018-P01-02', 4: '2018-P01-03', 5: '2018-P01-04', 6: '2018-P01-04', 7: '2018-P01-05', 8: '2018-P01-06', 9: '2018-P01-07'}, 'Odometer Reading': {0: 14567.1, 1: 98765.8, 2: 23406.2, 3: 54418.8, 4: 58838.35, 5: 63257.9, 6: 67677.45, 7: 72097.0, 8: 89221.0, 9: 89221.0}, 'Journey Code': {0: 261803, 1: 325429, 2: 359343, 3: 359343, 4: 370697, 5: 392535, 6: 394752, 7: 408713, 8: 408714, 9: 408715}, 'Fuel Type': {0: 'Diesel', 1: 'Diesel', 2: 'Diesel', 3: 'Diesel', 4: 'Diesel', 5: 'Diesel', 6: 'Diesel', 7: 'Diesel', 8: 'Diesel', 9: 'Diesel'}, 'Route Name': {0: 'France-Germany', 1: 'France-Germany', 2: 'France-Germany', 3: 'France-Germany', 4: 'France-Germany', 5: 'France-Germany', 6: 'France-Germany', 7: 'France-Germany', 8: 'France-Germany', 9: 'France-Germany'}, 'Area': {0: 'WE', 1: 'EA', 2: 'WCS', 3: 'WE', 4: 'WE', 5: 'EA', 6: 'WCS', 7: 'WE', 8: 'WE', 9: 'WE'}, 'Route': {0: 'High', 1: 'High', 2: 'High', 3: 'High', 4: 'High', 5: 'High', 6: 'High', 7: 'High', 8: 'High', 9: 'High'}, 'Fuel Load': {0: 165.0, 1: 210.0, 2: 170.0, 3: 130.0, 4: 152.5, 5: 142.0, 6: 131.5, 7: 121.0, 8: 121.0, 9: 121.0}, 'Delay Minutes': {0: 2.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 2.0, 5: 37.119401200000006, 6: 0.0, 7: 1.25, 8: 2.56, 9: 2.56}, 'Mileage': {0: 6.0, 1: 12.0, 2: 8.5, 3: 10.0, 4: 37.0, 5: 140.0, 6: 120.0, 7: 13.0, 8: 13.0, 9: 13.0}, 'Area Name': {0: 'West', 1: 'Eastern', 2: 'West Coast South', 3: 'West', 4: 'West', 5: 'Eastern', 6: 'West Coast South', 7: 'West', 8: 'West', 9: 'West'}}