Question

我有如下数据框df1

+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
|      | a        |   1 |
|      | a        |   2 |
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   3 |
|      | c        |   4 |
|      | c        |   4 |
|      | b        |   5 |
|      | b        |   6 |
|      | d        |   7 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   9 |
+------+----------+-----+

和下面的df2是从中切出的。

+------+----------+-----+
| Date | Location | Key |
+------+----------+-----+
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   3 |
|      | b        |   5 |
|      | b        |   6 |
|      | b        |   8 |
|      | b        |   8 |
|      | b        |   9 |
|      | b        |   9 |
+------+----------+-----+

目标是找出Key中df2变化之间的时间差（例如从最后3到5、5到6、6到前8，最后8到前9以及依此类推），将它们加起来，对每个Location项目重复此操作并取平均值。

此过程可以向量化还是我们需要为每台机器切片数据帧并手动计算平均值？

[编辑]：

Traceback (most recent call last):

  File "<ipython-input-1142-b85a122735aa>", line 1, in <module>
    s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 930, in apply
    return self._python_apply_general(f)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 936, in _python_apply_general
    self.axis)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 2273, in apply
    res = f(group)

  File "<ipython-input-1142-b85a122735aa>", line 1, in <lambda>
    s = temp.groupby('SSCM_ Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Execution Date'].diff().mean())

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 1995, in diff
    result = algorithms.diff(com._values_from_object(self), periods)

  File "C:\Users\dbhadra\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\algorithms.py", line 1823, in diff
    out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Answer 1

您可以尝试使用

g=df.groupby(['Location','Key'])
(g.first()-g.last().groupby('Location').shift()).mean(level=0)

Answer 2

s = df.groupby('Location').apply(lambda x: x[x['Key'].diff().ne(0)]['Date'].diff().mean())

这是您的意思吗？当每个位置的键值更改时，它将平均日期时间增量。如果您是说要平均更改“密钥”，只需将“日期”更改为“密钥”。

Answer 3

您可以尝试：

# obviously we will group by Location
groups = df1.groupby('Location')

# we record the changes and mark the unchanged with nan
df1['changes'] = groups.Key.diff().replace({0:np.nan})

# average the changes by location
# ignore all the nan's (unchanges)
groups.changes.mean()

输出：

Location
a    1.0
b    1.5
c    NaN
d    NaN
Name: changes, dtype: float64

熊猫数据框切片和操作

3 个答案: