我试图将c_med
一个值的值作为输入的阈值:1并将输入的两个不同输出中的上下值分开:2。写above.csv
& below.csv
参考专栏c_total
。
阅读above.csv
作为输入,并按照纯python中编写的第2点中提到的百分比对它们进行分类。
输入:1
date_count,all_hours,c_min,c_max,c_med,c_med_med,u_min,u_max,u_med,u_med_med
2,12,2309,19072,12515,13131,254,785,686,751
输入:2 ['date','startTime','endTime','day','c_total','u_total']
2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636
2004-01-06,06:00:00,07:00:00,Tue,5972,513
2004-01-06,07:00:00,08:00:00,Tue,3424,382
2004-01-06,08:00:00,09:00:00,Tue,2696,303
2004-01-06,09:00:00,10:00:00,Tue,2350,262
2004-01-06,10:00:00,11:00:00,Tue,2309,254
c_med
我收到以下错误:
Traceback (most recent call last):
File "class_med.py", line 10, in <module>
above_median = df_data['c_total'] > df_med['c_med']
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 735, in wrapper
raise ValueError('Series lengths must match to compare')
ValueError: Series lengths must match to compare
使用百分比过滤分隔的数据列c_total
。下面给出的纯python解决方案,但我正在寻找一个熊猫解决方案。比如Reference one
for row in csv.reader(inp):
if int(row[1])<(.20 * max_value):
val = 'viewers'
elif int(row[1])>=(0.20*max_value) and int(row[1])<(0.40*max_value):
val= 'event based'
elif int(row[1])>=(0.40*max_value) and int(row[1])<(0.60*max_value):
val= 'situational'
elif int(row[1])>=(0.60*max_value) and int(row[1])<(0.80*max_value):
val = 'active'
else:
val= 'highly active'
writer.writerow([row[0],row[1],val])
代码:
import pandas as pd
import numpy as np
df_med = pd.read_csv('stat_result.csv')
df_med.columns = ['date_count', 'all_hours', 'c_min', 'c_max', 'c_med', 'c_med_med', 'u_min', 'u_max', 'u_med', 'u_med_med']
df_data = pd.read_csv('mini_out.csv')
df_data.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
above = df_data['c_total'] > df_med['c_med']
#print above_median
above.to_csv('above.csv', index=None, header=None)
df_above = pd.readcsv('above_median.csv')
df_above.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']
#Percentage block should come here
修改:如果是单列值,则qcut
是最简单的解决方案。但是当涉及到使用两个不同列中的两个值时,如何在熊猫中实现这一点?
for row in csv.reader(inp):
if int(row[1])>(0.80*max_user) and int(row[2])>(0.80*max_key):
val='highly active'
elif int(row[1])>=(0.60*max_user) and int(row[2])<=(0.60*max_key):
val='active'
elif int(row[1])<=(0.40*max_user) and int(row[2])>=(0.40*max_key):
val='event based'
elif int(row[1])<(0.20*max_user) and int(row[2])<(0.20*max_key):
val ='situational'
else:
val= 'viewers'
答案 0 :(得分:1)
假设您有以下DF:
In [7]: df1
Out[7]:
date_count all_hours c_min c_max c_med c_med_med u_min u_max u_med u_med_med
0 2 12 2309 19072 12515 13131 254 785 686 751
In [8]: df2
Out[8]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254
按阈值分隔(您可以比较两个具有相同长度或标量值的系列 - 我假设您将第二个数据集与第一个数据集中的标量值(c_med
列)进行比较你的第一个数据集:
In [22]: above = df2[df2.c_total > df1.ix[0, 'c_med']]
In [23]: above
Out[23]:
date startTime endTime day c_total u_total
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735
您可以使用qcut()方法对数据进行分类:
In [29]: df2['cat'] = pd.qcut(df2.c_total,
....: q=[0, .2, .4, .6, .8, 1.],
....: labels=['viewers','event based','situational','active','highly active'])
In [30]: df2
Out[30]:
date startTime endTime day c_total u_total cat
0 2004-01-05 22:00:00 23:00:00 Mon 18944 790 highly active
1 2004-01-05 23:00:00 00:00:00 Mon 17534 750 active
2 2004-01-06 00:00:00 01:00:00 Tue 17262 747 active
3 2004-01-06 01:00:00 02:00:00 Tue 19072 777 highly active
4 2004-01-06 02:00:00 03:00:00 Tue 18275 785 highly active
5 2004-01-06 03:00:00 04:00:00 Tue 13589 757 situational
6 2004-01-06 04:00:00 05:00:00 Tue 16053 735 situational
7 2004-01-06 05:00:00 06:00:00 Tue 11440 636 situational
8 2004-01-06 06:00:00 07:00:00 Tue 5972 513 event based
9 2004-01-06 07:00:00 08:00:00 Tue 3424 382 event based
10 2004-01-06 08:00:00 09:00:00 Tue 2696 303 viewers
11 2004-01-06 09:00:00 10:00:00 Tue 2350 262 viewers
12 2004-01-06 10:00:00 11:00:00 Tue 2309 254 viewers
检查:
In [32]: df2.assign(pct=df2.c_total/df2.c_total.max())[['c_total','pct','cat']]
Out[32]:
c_total pct cat
0 18944 0.993289 highly active
1 17534 0.919358 active
2 17262 0.905096 active
3 19072 1.000000 highly active
4 18275 0.958211 highly active
5 13589 0.712510 situational
6 16053 0.841705 situational
7 11440 0.599832 situational
8 5972 0.313129 event based
9 3424 0.179530 event based
10 2696 0.141359 viewers
11 2350 0.123217 viewers
12 2309 0.121068 viewers