按阈值分开

时间:2016-08-14 00:39:07

标签: python csv pandas numpy

我试图将c_med一个值的值作为输入的阈值:1并将输入的两个不同输出中的上下值分开:2。写above.csv& below.csv参考专栏c_total

阅读above.csv作为输入,并按照纯python中编写的第2点中提到的百分比对它们进行分类。

输入:1

date_count,all_hours,c_min,c_max,c_med,c_med_med,u_min,u_max,u_med,u_med_med
2,12,2309,19072,12515,13131,254,785,686,751

输入:2 ['date','startTime','endTime','day','c_total','u_total']

2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636
2004-01-06,06:00:00,07:00:00,Tue,5972,513
2004-01-06,07:00:00,08:00:00,Tue,3424,382
2004-01-06,08:00:00,09:00:00,Tue,2696,303
2004-01-06,09:00:00,10:00:00,Tue,2350,262
2004-01-06,10:00:00,11:00:00,Tue,2309,254
  1. 我正在尝试从其他输入csv c_med
  2. 读取阈值

    我收到以下错误:

    Traceback (most recent call last):
      File "class_med.py", line 10, in <module>
        above_median = df_data['c_total'] > df_med['c_med']
      File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 735, in wrapper
        raise ValueError('Series lengths must match to compare')
    ValueError: Series lengths must match to compare
    
    1. 使用百分比过滤分隔的数据列c_total。下面给出的纯python解决方案,但我正在寻找一个熊猫解决方案。比如Reference one

      for row in csv.reader(inp): if int(row[1])<(.20 * max_value): val = 'viewers' elif int(row[1])>=(0.20*max_value) and int(row[1])<(0.40*max_value): val= 'event based'
      elif int(row[1])>=(0.40*max_value) and int(row[1])<(0.60*max_value): val= 'situational' elif int(row[1])>=(0.60*max_value) and int(row[1])<(0.80*max_value): val = 'active' else: val= 'highly active' writer.writerow([row[0],row[1],val])

    2. 代码:

      import pandas as pd 
      import numpy as np
      
      df_med = pd.read_csv('stat_result.csv')
      df_med.columns = ['date_count', 'all_hours', 'c_min', 'c_max', 'c_med', 'c_med_med', 'u_min', 'u_max', 'u_med', 'u_med_med']
      
      df_data = pd.read_csv('mini_out.csv')
      df_data.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
      
      above = df_data['c_total'] > df_med['c_med']
      
      #print above_median
      
      above.to_csv('above.csv', index=None, header=None)
      
      df_above = pd.readcsv('above_median.csv')
      df_above.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
      
      #Percentage block should come here
      

      修改:如果是单列值,则qcut是最简单的解决方案。但是当涉及到使用两个不同列中的两个值时,如何在熊猫中实现这一点?

      for row in csv.reader(inp):
              if int(row[1])>(0.80*max_user) and int(row[2])>(0.80*max_key):
                  val='highly active'
              elif int(row[1])>=(0.60*max_user) and int(row[2])<=(0.60*max_key):
                  val='active'
              elif int(row[1])<=(0.40*max_user) and int(row[2])>=(0.40*max_key):  
                  val='event based'
              elif int(row[1])<(0.20*max_user) and int(row[2])<(0.20*max_key):
                  val ='situational'
              else:
                  val= 'viewers'
      

1 个答案:

答案 0 :(得分:1)

假设您有以下DF:

In [7]: df1
Out[7]:
   date_count  all_hours  c_min  c_max  c_med  c_med_med  u_min  u_max  u_med  u_med_med
0           2         12   2309  19072  12515      13131    254    785    686        751

In [8]: df2
Out[8]:
          date startTime   endTime  day  c_total  u_total
0   2004-01-05  22:00:00  23:00:00  Mon    18944      790
1   2004-01-05  23:00:00  00:00:00  Mon    17534      750
2   2004-01-06  00:00:00  01:00:00  Tue    17262      747
3   2004-01-06  01:00:00  02:00:00  Tue    19072      777
4   2004-01-06  02:00:00  03:00:00  Tue    18275      785
5   2004-01-06  03:00:00  04:00:00  Tue    13589      757
6   2004-01-06  04:00:00  05:00:00  Tue    16053      735
7   2004-01-06  05:00:00  06:00:00  Tue    11440      636
8   2004-01-06  06:00:00  07:00:00  Tue     5972      513
9   2004-01-06  07:00:00  08:00:00  Tue     3424      382
10  2004-01-06  08:00:00  09:00:00  Tue     2696      303
11  2004-01-06  09:00:00  10:00:00  Tue     2350      262
12  2004-01-06  10:00:00  11:00:00  Tue     2309      254

按阈值分隔(您可以比较两个具有相同长度或标量值的系列 - 我假设您将第二个数据集与第一个数据集中的标量值(c_med列)进行比较你的第一个数据集:

In [22]: above = df2[df2.c_total > df1.ix[0, 'c_med']]

In [23]: above
Out[23]:
         date startTime   endTime  day  c_total  u_total
0  2004-01-05  22:00:00  23:00:00  Mon    18944      790
1  2004-01-05  23:00:00  00:00:00  Mon    17534      750
2  2004-01-06  00:00:00  01:00:00  Tue    17262      747
3  2004-01-06  01:00:00  02:00:00  Tue    19072      777
4  2004-01-06  02:00:00  03:00:00  Tue    18275      785
5  2004-01-06  03:00:00  04:00:00  Tue    13589      757
6  2004-01-06  04:00:00  05:00:00  Tue    16053      735

您可以使用qcut()方法对数据进行分类:

In [29]: df2['cat'] = pd.qcut(df2.c_total,
   ....:                        q=[0, .2, .4, .6, .8, 1.],
   ....:                        labels=['viewers','event based','situational','active','highly active'])

In [30]: df2
Out[30]:
          date startTime   endTime  day  c_total  u_total            cat
0   2004-01-05  22:00:00  23:00:00  Mon    18944      790  highly active
1   2004-01-05  23:00:00  00:00:00  Mon    17534      750         active
2   2004-01-06  00:00:00  01:00:00  Tue    17262      747         active
3   2004-01-06  01:00:00  02:00:00  Tue    19072      777  highly active
4   2004-01-06  02:00:00  03:00:00  Tue    18275      785  highly active
5   2004-01-06  03:00:00  04:00:00  Tue    13589      757    situational
6   2004-01-06  04:00:00  05:00:00  Tue    16053      735    situational
7   2004-01-06  05:00:00  06:00:00  Tue    11440      636    situational
8   2004-01-06  06:00:00  07:00:00  Tue     5972      513    event based
9   2004-01-06  07:00:00  08:00:00  Tue     3424      382    event based
10  2004-01-06  08:00:00  09:00:00  Tue     2696      303        viewers
11  2004-01-06  09:00:00  10:00:00  Tue     2350      262        viewers
12  2004-01-06  10:00:00  11:00:00  Tue     2309      254        viewers

检查:

In [32]: df2.assign(pct=df2.c_total/df2.c_total.max())[['c_total','pct','cat']]
Out[32]:
    c_total       pct            cat
0     18944  0.993289  highly active
1     17534  0.919358         active
2     17262  0.905096         active
3     19072  1.000000  highly active
4     18275  0.958211  highly active
5     13589  0.712510    situational
6     16053  0.841705    situational
7     11440  0.599832    situational
8      5972  0.313129    event based
9      3424  0.179530    event based
10     2696  0.141359        viewers
11     2350  0.123217        viewers
12     2309  0.121068        viewers