Question

我正在寻找指向相应文档的指针，以便在pylab中使用pandas完成下面描述的分析任务。我之前编写过python + matplotlib函数，可以完成大部分工作，但是生成的代码很慢，维护起来很麻烦。似乎熊猫拥有所需的能力，但我陷入困境，试图找到正确的方法和功能。

In [1]: import pandas as pd

In [6]: df = pd.read_csv("tinyexample.csv", parse_dates=2)

In [7]: df
Out[7]: 
   I                  t       A      B        C     D        E
0  1  08/06/13 02:34 PM  109.40  105.50  124.30  1.00  1930.95
1  1  08/06/13 02:35 PM  110.61  106.21  124.30  0.90  1964.89
2  1  08/06/13 02:37 PM  114.35  108.84  124.30  0.98  2654.33
3  1  08/06/13 02:38 PM  115.38  109.81  124.30  1.01  2780.63
4  1  08/06/13 02:40 PM  116.08  110.94  124.30  0.99  2521.28
5  4  08/06/13 02:34 PM  105.03  100.96  127.43  1.12  2254.51
6  4  08/06/13 02:35 PM  106.73  101.72  127.43  1.08  2661.76
7  4  08/06/13 02:38 PM  111.21  105.17  127.38  1.06  3163.07
8  4  08/06/13 02:40 PM  111.69  106.28  127.38  1.09  2898.73

以上是来自无线电连接数据记录器网络的一小部分每分钟读数。样本显示在10分钟内从2个记录器输出。该实际数据文件在几天内从几十个记录器输出。

列'I'是记录器ID，'t'是时间戳，'AC'是温度，'D'是流速，'E'是从A，B和D计算的能量率。

由于无线电连接不良所有记录器中随机时间都缺少读数。

具体来说，我想做类似以下的事情

for i in I:
    ## Insert rows for all missing timestamps with interpolated values for A through E
    ## Update a new column 'F' with a cumulative sum of 'E' (actually E/60)

然后我希望能够定义一个绘图功能，允许我输出垂直对齐的条形图类似于http://pandas.pydata.org/pandas-docs/dev/visualization.html文档中显示的内容。我试过了

df.plot(subplots=True, sharex=True)

除了

之外，它几乎可以满足我的需求

按索引号而不是按日期绘制。
它不会为每个记录器ID创建单独的绘图线。

plot results

最后，我希望能够选择记录器ID和数据列的子集来绘制，例如。

def myplot(df, ilist, clist):
    """
    ilist is of the form [ n, m, p, ...] where n, m, and p are logger id's in column 'I'
    clist is a list of column labels.

    Produces stack of strip chart plots, one for each column contain plot lines for each id.
    """

解决方案（使用Dan Allan接受的答案 - 谢谢，Dan）

import pandas as pd
import matplotlib.pyplot as plt 

def myinterpolator(grp, cols = ['I', 'A', 'B', 'C', 'D', 'E']):
    index = pd.date_range(freq='1min', 
            start=grp.first_valid_index(), 
            end=grp.last_valid_index())
    g1  = grp.reindex(set(grp.index).union(index)).sort_index()
    for col in cols:
        g1[col] = g1[col].interpolate('time').ix[index]
    g1['F'] = g1['E'].cumsum()    
    return g1 


def myplot(df, ilist, clist):
    df1 = df[df['I'].isin(ilist)][clist + ['I']]
    fig, ax = plt.subplots(len(clist))
    for I, grp in df1.groupby('I'):
        for j, col in enumerate(clist):
            grp[col].plot(ax=ax[j], sharex=True)


df = pd.read_csv("tinyexample.csv", parse_dates=True, index_col=1)

df_interpolated = pd.concat([myinterpolator(grp) for I, grp in df.groupby('I')])
myplot(df_interpolated, ilist=[1,4], clist=['F', 'A', 'C'])
plt.tight_layout()

Answer 1

这两件事很棘手：插值（参见汤姆的评论）以及你想在同一个子情节中绘制不同传感器的愿望。 subplots=True关键字不足以满足这一要求;你必须使用一个循环。这很有效。

import matplotlib.pyplot as plt

def myplot(df, ilist, clist):
    df1 = df[df['I'].isin(ilist)][clist + ['t', 'I']].set_index('t')
    fig, ax = plt.subplots(len(clist))
    for I, grp in df1.groupby('I'):
        for j, col in enumerate(clist):
            grp[col].plot(ax=ax[j], sharex=True)

用法：

df['t'] = pd.to_datetime(df['t']) # Make sure pandas treats t as times.
myplot(df, [1, 4], ['A', 'B', 'C'])
plt.tight_layout() # cleans up the spacing of the plots

enter image description here

您实际上可能需要插值。即使缺少某些数据，上述操作也会执行，并且绘图线可视地内插数据。但是，如果您想要实际插值 - 比如说进行其他分析 - 请参阅this answer。

Pandas：插入缺失的行并在数据帧中绘制多个系列

1 个答案: