我将输入日志转换为输入数据集的方法非常糟糕。 我有一个SFrame sf,格式如下:
user_id int
timestamp datetime.datetime
action int
reasoncode str
操作列占用9个值,范围从1到9。
因此,每个user_id可以执行多次操作,不止一次。
我正在尝试从sf获取所有唯一的user_id,并按以下方式创建op_sf:
y = 225
def calc_class(a,x):
diffd = a['timestamp'].apply(lambda x: (dte - x).days)
g = 0
b = 0
for i in diffd:
if i > y:
g += 1
else:
b += 1
if b>= x:
return 4
elif b!= 0:
return 3
elif g>= 0:
return 2
else:
return 1
l1 = []
ids = z['user_id'].unique()
for idd in ids:
temp = sf[sf['user_id']== idd]
zero1 = temp[temp['action'] == 1]
zero2 = temp[temp['action'] == 2]
zero3 = temp[temp['action'] == 3]
zero4 = temp[temp['action'] == 4]
zero5 = temp[temp['action'] == 5]
zero6 = temp[temp['action'] == 6]
zero7 = temp[temp['action'] == 7]
zeroh8 = temp[temp['reasoncode'] == 'xyz']
zero9 = temp[temp['reasoncode'] == 'abc']
/* I'm getting clas1 to clas9 from function calc_class for each action
clas1 to clas9 are 4 integers ranging from 1 to 4
*/
clas1 = calc_class(zero1,2)
clas2 = calc_class(zero2,2)
clas3 = calc_class(zero3,2)
clas4 = calc_class(zero4,2)
clas5 = calc_class(zero5,2)
clas6 = calc_class(zero6,2)
clas7 = calc_class(zero7,2)
clas8 = calc_class(zero8,2)
clas9 = calc_class(zero9,2)
l1.append([idd,clas1,clas2,clas3,clas4,clas5*(-1),clas6*(-1),clas7*(-1),clas8*(-1),clas9])
我想知道这是否是最快的方法。具体来说,如果可以在不生成zero1到zero9 SFrame的情况下执行相同的操作。
示例sf:
user_id timestamp action reasoncode
574 23/09/15 12:43 1 None
574 23/09/15 11:15 2 None
574 06/10/15 11:20 2 None
574 06/10/15 11:21 3 None
588 04/11/15 10:00 1 None
588 05/11/15 10:00 1 None
555 15/12/15 13:00 1 None
585 22/12/15 17:30 1 None
585 15/01/16 07:44 7 xyz
588 06/01/16 08:10 7 abc
l1对应上面的sf:
574 1 2 2 0 0 0 0 0 0
588 3 0 0 0 0 0 0 0 3
555 3 0 0 0 0 0 0 0 0
585 3 0 0 0 0 0 0 3 0
答案 0 :(得分:1)
我认为您的逻辑相对复杂,但在整个数据集上使用逐列操作仍然更有效,而不是为每个用户提取行的子集。关键工具包括SFrame.groupby
,SFrame.apply
,SFrame.unstack
和SFrame.unpack
。 API文档在这里:
https://dato.com/products/create/docs/generated/graphlab.SFrame.html
这是一个使用比您的示例稍微简单的数据的解决方案,以及稍微简单的逻辑来编码旧操作和新操作。
# Set up and make the data
import graphlab as gl
import datetime as dt
sf = gl.SFrame({'user': [574, 574, 574, 588, 588, 588],
'timestamp': [dt.datetime(2015, 9, 23), dt.datetime(2015, 9, 23),
dt.datetime(2015, 10, 6), dt.datetime(2015, 11, 4),
dt.datetime(2015, 11, 5), dt.datetime(2016, 1, 6)],
'action': [1, 2, 3, 1, 1, 7]})
# Count old vs. new actions.
sf['days_elapsed'] = (dt.datetime.today() - sf['timestamp']) / (3600 * 24)
sf['old_threshold'] = sf['days_elapsed'] > 225
aggregator = {'total_count': gl.aggregate.COUNT('user'),
'old_count': gl.aggregate.SUM('old_threshold')}
grp = sf.groupby(['user', 'action'], aggregator)
# Code the actions according to old vs. new. Use your own logic here.
grp['action_code'] = grp.apply(
lambda x: 2 if x['total_count'] > x['old_count'] else 1)
grp = grp[['user', 'action', 'action_code']]
# Reshape the results into columns.
sf_new = (grp.unstack(['action', 'action_code'], new_column_name='action_code')
.unpack('action_code'))
# Fill in zeros for entries with no actions.
for c in sf_new.column_names():
sf_new[c] = sf_new[c].fillna(0)
print sf_new
+------+---------------+---------------+---------------+---------------+
| user | action_code.1 | action_code.2 | action_code.3 | action_code.7 |
+------+---------------+---------------+---------------+---------------+
| 588 | 2 | 0 | 0 | 2 |
| 574 | 1 | 1 | 1 | 0 |
+------+---------------+---------------+---------------+---------------+
[2 rows x 5 columns]