Question

声音非常复杂，但是简单的情节将使其易于理解：我有一些随时间变化的累计值的三条曲线，即蓝线。

我想将这三个曲线平均（或以某种统计正确的方式组合）成一条平滑曲线并添加置信区间。

我尝试了一种简单的解决方案-将所有数据组合到一条曲线中，并使用熊猫的“滚动”功能对其求平均值，以获取其标准差。我将其绘制为紫色曲线，并在其周围置信区间。

我的真实数据存在问题，并且如上图所示，曲线完全不平滑，置信区间也出现了急剧的跳跃，这也不能很好地表示3条独立曲线，因为他们没有跳跃。

是否有更好的方法可以在一条平滑曲线中以良好的置信区间表示3条不同的曲线？

我提供了一个测试代码，在python 3.5.1上使用numpy和pandas进行了测试（不要更改种子以获得相同的曲线）。

有一些限制-增加“滚动”功能的点数对我来说不是解决方案，因为我的某些数据太短了。

测试代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
np.random.seed(seed=42)


## data generation - cumulative analysis over time
df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals'])
df1_combined_sorted =  pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time'])
df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals'])

df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals'])
df2_combined_sorted =  pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time'])
df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals'])

df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals'])
df3_combined_sorted =  pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time'])
df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals'])


## combining the three curves
df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,.
    df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True)
df_all_time =  pd.concat([df1_combined_sorted['time'],
    df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True)
df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1)


## creating confidence intervals 
df_all_sorted = df_all.sort_values(by=['time'])
ma = df_all_sorted.rolling(10).mean()
mstd = df_all_sorted.rolling(10).std()


## plotting
plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'],
        ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2)
plt.plot(df_all_sorted['time'],ma['vals'], c='purple')
plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue')
plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue')
plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue')
matplotlib.use('Agg')
plt.show()

Answer 1

首先，可以重写示例代码以更好地利用null >= 2。例如

pd

曲线不那么平滑的原因可能是滚动窗口不够大。您可以增加此窗口的大小以获得更平滑的图形。例如np.random.seed(seed=42) ## data generation - cumulative analysis over time def get_data(max_val, max_time=1000): times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time']) vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals']) df = pd.concat([times, vals], axis = 1).sort_values(by=['time']).\ reset_index().drop('index', axis=1) df['cumulative'] = df.vals.cumsum() return df # generate the dataframes df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000])) dfs = (df1, df2, df3) # join df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time']) # render function def render(window=10): # compute rolling means and confident intervals mean_val = df_all.cumulative.rolling(window).mean() std_val = df_all.cumulative.rolling(window).std() min_val = mean_val - 2*std_val max_val = mean_val + 2*std_val plt.figure(figsize=(16,9)) for df in dfs: plt.plot(df.time, df.cumulative, c='blue') plt.plot(df_all.time, mean_val, c='r') plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2) plt.show()给出：

而render(20)给出：

尽管，更好的方法可能是将render(30)中的每一个插入整个时间窗口，并计算这些序列的均值/置信区间。考虑到这一点，我们可以如下修改代码：

df['cumulative']

，我们得到：

平均几个时间序列以及置信区间（带有测试代码）

1 个答案: