我正在为机器学习模型处理大量数据。由于数据集很大,因此我的功能之一是运行时间太长。是否有任何熊猫函数可以替代以下代码:
df = pd.DataFrame({'Weight':[45, 88, 45, 88, 45, 88, 54, 45, 88],
'Name':['Sam', 'Sia', 'Sam', 'Sia', 'Sam', 'Sia', 'Ryan', 'Sam', 'Sia'],
'Age':[100, 95, 93, 90, 10, 95, 92, 110, 33]})
my_group = df.groupby(['Name'])
col_names = []
diff_range = 5
for pair in my_group:
for i in range(1, diff_range+1):
col_names.append(str(i))
difference_df[str(i)] = df['Age'].diff(i).shift(periods=-i)
difference_df['d_id_max'] = difference_df[col_names].idxmax(axis=1)
上面的代码首先是让每个组获取我数据帧的每一行,然后计算与'model_prediction'列的该行与下3行的差异,最后返回与该行具有最大差异的行的索引。 / p>
Weight Name Age
0 45 Sam 100
1 88 Sia 95
2 45 Sam 93
3 88 Sia 90
4 45 Sam 10
5 88 Sia 95
6 54 Ryan 92
7 45 Sam 110
8 88 Sia 33
预期输出:
Weight Name Age 1 2 3 4 5 d_id_max
0 45 Sam 100 -5.0 -7.0 -10.0 -90.0 -5.0 1
1 88 Sia 95 -2.0 -5.0 -85.0 0.0 -3.0 4
2 45 Sam 93 -3.0 -83.0 2.0 -1.0 17.0 5
3 88 Sia 90 -80.0 5.0 2.0 20.0 -57.0 4
4 45 Sam 10 85.0 82.0 100.0 23.0 NaN 3
5 88 Sia 95 -3.0 15.0 -62.0 NaN NaN 2
6 54 Ryan 92 18.0 -59.0 NaN NaN NaN 1
7 45 Sam 110 -77.0 NaN NaN NaN NaN 1
8 88 Sia 33 NaN NaN NaN NaN NaN NaN
答案 0 :(得分:1)
使用df.shift()
计算行之间的差,然后使用df.idxmax()
获取具有最大值的列。
in-string
输出:
(module string-util typed/racket
(provide (all-defined-out))
(: empty-string? : (-> String Boolean))
(define (empty-string? s)
(string=? "" s))
(: string-first : (-> String String))
(define (string-first s)
(substring s 0 1))
(: string-last : (-> String String))
(define (string-last s)
(substring s (- (string-length s) 1) (string-length s)))
(: string-rest : (-> String String))
(define (string-rest s)
(substring s 1 (string-length s))))
(require 'string-util)
(define (split-string-recur str)
(cond [(or (empty-string? str) (empty-string? (string-rest str))) '()]
[else (cons (string-append (string-first str) (string-first (string-rest str)))
(split-string-recur (string-rest (string-rest str))))]))