数据框中的多个计数和中值

时间:2016-07-15 21:21:25

标签: python datetime pandas median median-of-medians

我试图在一个程序中同时执行多个操作。 我有一个Dates的数据框,其中我没有开头和结尾的线索,我想找到:

  1. 数据集的总天数
  2. 总小时数
  3. 伯爵的中位数
  4. 为每日/日期的中位数写一个单独的输出。
  5. 如果可能的话,以最简单的方式使用中位数中位数。
  6. 输入:GB大小的大文件中的几行

    2004-01-05,16:00:00,17:00:00,Mon,10766,656
    2004-01-05,17:00:00,18:00:00,Mon,12223,670
    2004-01-05,18:00:00,19:00:00,Mon,12646,710
    2004-01-05,19:00:00,20:00:00,Mon,19269,778
    2004-01-05,20:00:00,21:00:00,Mon,20504,792
    2004-01-05,21:00:00,22:00:00,Mon,16553,783
    2004-01-05,22:00:00,23:00:00,Mon,18944,790
    2004-01-05,23:00:00,00:00:00,Mon,17534,750
    2004-01-06,00:00:00,01:00:00,Tue,17262,747
    2004-01-06,01:00:00,02:00:00,Tue,19072,777
    2004-01-06,02:00:00,03:00:00,Tue,18275,785
    2004-01-06,03:00:00,04:00:00,Tue,13589,757
    2004-01-06,04:00:00,05:00:00,Tue,16053,735
    

    开始和结束日期未知。

    修改 预期输出:1只有一行结果

    days,hours,median,median-of-median
    2,17262,13,17398
    

    Median-of-Median是输出2的median列的中值

    预期输出:2,将具有每个日期的中位数,用于查找中位数中位数

    date,median
    2004-01-05,17534
    2004-01-06,17262
    

    代码:

    import pandas as pd 
    from datetime import datetime
    
    df = pd.read_csv('one_hour.csv')
    df.columns = ['date', 'startTime', 'endTime', 'day', 'count', 'unique']
    
    date_count = df.count(['date'])
    all_median = df.median(['count'])
    all_hours = df.count(['startTime'])
    med_med = df.groupby(['date','count']).median()
    
    print date_count
    print all_median
    print all_hours
    
    stats = ['date_count', 'all_median', 'all_hours', 'median-of-median']
    stats.to_csv('stats_all.csv', index=False)
    
    med_med.to_csv('med_day.csv', index=False, header=False)
    

    显然,代码没有按照预期给出结果。

    错误如下所示。

    错误:

    Traceback (most recent call last):
      File "day_median.py", line 8, in <module>
        all_median = df.median(['count'])
      File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 5310, in stat_func
        numeric_only=numeric_only)
      File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4760, in _reduce
        axis = self._get_axis_number(axis)
      File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 308, in _get_axis_number
        axis = self._AXIS_ALIASES.get(axis, axis)
    TypeError: unhashable type: 'list'
    

1 个答案:

答案 0 :(得分:3)

IIUC可能有助于改变:

date_count = df.count(['date'])
all_median = df.median(['count'])
all_hours = df.count(['startTime'])

到:

date_count = df['date'].count()
all_median = df['count'].median()
all_hours = df['startTime'].count()

print (date_count)
print (all_median)
print (all_hours)
13
17262.0
13

如果需要统计来自datecountstartTime列的统计信息。

通过评论编辑:

如果需要计算列的唯一值,请使用nunique

date_count = df['date'].nunique()
print (date_count)
2

DataFrame stats

cols = ['date_count', 'all_median', 'all_hours']
stats = pd.DataFrame([[date_count, all_median, all_hours]], columns = cols)
print (stats)
   date_count  all_median  all_hours
0           2     17262.0         13