我有如下数据框df1
+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
| | a | 1 |
| | a | 2 |
| | b | 3 |
| | b | 3 |
| | b | 3 |
| | c | 4 |
| | c | 4 |
| | b | 5 |
| | b | 6 |
| | d | 7 |
| | b | 8 |
| | b | 8 |
| | b | 8 |
| | b | 9 |
+------+----------+-----+
和下面的df2
是从中切出的。
+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
| | b | 3 |
| | b | 3 |
| | b | 3 |
| | b | 5 |
| | b | 6 |
| | b | 8 |
| | b | 8 |
| | b | 9 |
| | b | 9 |
+------+----------+-----+
目标是找出Key
中df2
变化之间的时间差(例如从最后3到5、5到6、6到前8,最后8到前9以及依此类推),将它们加起来,对每个Location
项目重复此操作并取平均值。
此过程可以向量化还是我们需要为每台机器切片数据帧并手动计算平均值?
[编辑]:
Traceback (most recent call last):
File "<ipython-input-1142-b85a122735aa>", line 1, in <module>
s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())
File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 930, in apply
return self._python_apply_general(f)
File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 936, in _python_apply_general
self.axis)
File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 2273, in apply
res = f(group)
File "<ipython-input-1142-b85a122735aa>", line 1, in <lambda>
s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())
File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 1995, in diff
result = algorithms.diff(com._values_from_object(self), periods)
File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py", line 1823, in diff
out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]
TypeError: unsupported operand type(s) for -: 'str' and 'str'
答案 0 :(得分:0)
您可以尝试使用
g=df.groupby(['Location','Key'])
(g.first()-g.last().groupby('Location').shift()).mean(level=0)
答案 1 :(得分:0)
s = df.groupby('Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Date'].diff().mean())
这是您的意思吗?当每个位置的键值更改时,它将平均日期时间增量。如果您是说要平均更改“密钥”,只需将“日期”更改为“密钥”。
答案 2 :(得分:0)
您可以尝试:
# obviously we will group by Location
groups = df1.groupby('Location')
# we record the changes and mark the unchanged with nan
df1['changes'] = groups.Key.diff().replace({0:np.nan})
# average the changes by location
# ignore all the nan's (unchanges)
groups.changes.mean()
输出:
Location
a 1.0
b 1.5
c NaN
d NaN
Name: changes, dtype: float64