Sum函数与数据帧无法正常工作(Python)

时间:2018-05-01 17:15:20

标签: python pandas date dataframe sum

我正在使用一个包含四列的大型excel文件,但我只需要两个:Date和HPCP。此程序的目标是将日期转换为日期对象,删除重复日期,然后汇总重复项的HPCP。我觉得这段代码应该可行,但是输出非常错误。代码成功地将日期转换为日期对象,删除重复项,但不能正确汇总。任何帮助将不胜感激。

链接到excel文件: https://drive.google.com/open?id=1P5-k9Zyz8iFwx6Y-9yhnRozGGSvqpXLz

excel文件中的一些行示例:

      STATION           STATION_NAME         DATE        HPCP
COOP:305801 NY CITY  CENTRAL PARK NY US  20000101 01:00  0
COOP:305801 NY CITY  CENTRAL PARK NY US  20000104 15:00  0
COOP:305801 NY CITY  CENTRAL PARK NY US  20000104 16:00  0.01
COOP:305801 NY CITY  CENTRAL PARK NY US  20000104 17:00  0.03
COOP:305801 NY CITY  CENTRAL PARK NY US  20000104 18:00  0.04
COOP:305801 NY CITY  CENTRAL PARK NY US  20000104 19:00  0.12
COOP:305801 NY CITY  CENTRAL PARK NY US  20000104 20:00  0.17
COOP:305801 NY CITY  CENTRAL PARK NY US  20000104 21:00  0.13
COOP:305801 NY CITY  CENTRAL PARK NY US  20000104 22:00  0.04
COOP:305801 NY CITY  CENTRAL PARK NY US  20000104 23:00  0.09
COOP:305801 NY CITY  CENTRAL PARK NY US  20000105 00:00  0.07
COOP:305801 NY CITY  CENTRAL PARK NY US  20000105 01:00  0
COOP:305801 NY CITY  CENTRAL PARK NY US  20000109 21:00  0.01
COOP:305801 NY CITY  CENTRAL PARK NY US  20000109 22:00  0
COOP:305801 NY CITY  CENTRAL PARK NY US  20000110 00:00  0.01
COOP:305801 NY CITY  CENTRAL PARK NY US  20000110 13:00  0.15
COOP:305801 NY CITY  CENTRAL PARK NY US  20000110 14:00  0.29
COOP:305801 NY CITY  CENTRAL PARK NY US  20000110 15:00  0.24
COOP:305801 NY CITY  CENTRAL PARK NY US  20000110 16:00  0.15
COOP:305801 NY CITY  CENTRAL PARK NY US  20000110 17:00  0.01
COOP:305801 NY CITY  CENTRAL PARK NY US  20000113 08:00  0
COOP:305801 NY CITY  CENTRAL PARK NY US  20000113 09:00  0.01
COOP:305801 NY CITY  CENTRAL PARK NY US  20000113 10:00  0.02
COOP:305801 NY CITY  CENTRAL PARK NY US  20000113 15:00  0.01
COOP:305801 NY CITY  CENTRAL PARK NY US  20000113 16:00  0.01
COOP:305801 NY CITY  CENTRAL PARK NY US  20000113 17:00  0
COOP:305801 NY CITY  CENTRAL PARK NY US  20000120 07:00  0
COOP:305801 NY CITY  CENTRAL PARK NY US  20000120 08:00  0
COOP:305801 NY CITY  CENTRAL PARK NY US  20000120 09:00  0

代码:

import sys
import pandas as pd
import datetime

data = pd.read_csv(sys.argv[1])
data = data[['DATE','HPCP']]

data['DATE'] = pd.to_datetime(data['DATE'])

for index, row in data.iterrows():
    print index
    data.loc[index,'DATE'] = data.loc[index,'DATE'].date()

data = data.groupby(['DATE'],as_index=False).sum()

print data

输出:

        DATE      HPCP
0    2000-01-01  11999.88
1    2000-01-03      0.00
2    2000-01-04   1002.97
3    2000-01-05      1.25
4    2000-01-09   1000.01
5    2000-01-10      4.72
6    2000-01-11      0.00
7    2000-01-13      0.17
8    2000-01-16      0.00
9    2000-01-20   1000.11
10   2000-01-21      0.12
        ...       ...
2871 2013-12-17      0.66
2872 2013-12-21      0.01
2873 2013-12-22      0.04
2874 2013-12-23      2.06
2875 2013-12-24      0.00
2876 2013-12-26      0.00
2877 2013-12-29      4.90
2878 2013-12-30      0.00
2879 2013-12-31      0.00
2880 2014-01-01   3999.96

2 个答案:

答案 0 :(得分:0)

没错,那些大价值是正确的。我将您的数据文件导入Excel,使用每行的数据和HPCP的总和作为值创建了一个数据透视表,这里是前几个结果:

Row Labels  Sum of HPCP
1/1/2000    11999.88
1/3/2000    0
1/4/2000    1002.97
1/5/2000    1.25
1/9/2000    1000.01
...

好消息是您的代码很好。

为了阻止对你的问题的评论中的长时间讨论,我只是说你需要区分"令人惊讶的"来自"错误"。这些结果令人惊讶"鉴于HPCP列中的值通常较小,但它们并非“错误”#34;。也许您想要一个不同的指标(平均值?max?)或者您想要进行一些预过滤,但对于您已经给出的数据以及您想要做的描述,您的代码及其结果是正确的,即使输出值是意外的。

答案 1 :(得分:0)

.csv文件中有许多行,您链接到Glide .with(context) .load(yourImageUrl) .override(200, 200) .into(imageView); 的999.99。您的总和正在为此数据正常工作。