重采样,分层,分类+时间数据的散景图

时间:2019-07-07 23:27:50

标签: python dataframe bokeh hierarchical-data

我知道我已经接近这一点,但是我只是无法散景来做我想要的。我需要将时间数据重新采样为15分钟的间隔,然后按层次,分类类型对其进行分组,并绘制时间组中的结果图。将不胜感激。

我有如下数据:

    basket_id   food_type               classified_time             dipped_time                 slot_number
0   185261      CHICKEN FILLETS         2019-07-07 11:38:23.153858  2019-07-07 11:38:40.271070  8
1   185263      CHICKEN FILLETS         2019-07-07 11:38:25.831668  2019-07-07 11:38:53.265553  4
2   185273      CRISPY CHICKEN TENDERS  2019-07-07 11:39:26.184932  2019-07-07 11:39:58.164302  5
3   185276      CRISPY CHICKEN TENDERS  2019-07-07 11:39:30.178273  2019-07-07 11:39:46.076617  1
...

我可以重新采样这些数据,以便获得此结果,看起来非常正确:

agg_15m = df[['dipped_time', 'food_type']] \
            .set_index('dipped_time', 'food_type') \
            .groupby('food_type') \
            .resample('15Min') \
            .agg({'food_type': 'count'}) \
            .rename(columns={'food_type':'COUNT'}) \
            .reset_index()
display(agg_15m)

resampled data

然后我可以使用groupby来获得我认为正确的结构:

group = agg_15m.groupby(['dipped_time', 'food_type'])
display(group.sum())

grouped by time and food type

仅此一项就需要在数据帧中进行大量的计算,因为我并不真正熟悉使用多索引数据的概念。

现在好玩的是,尝试让Bokeh对这些数据进行处理。 This instruction from bokeh似乎提供了正确的方向;但是,它仅使用单个groupby。 This instruction from bokeh为分层分类数据提供了一些指导,但是该示例仅使用文字来完成。

这就是我尝试过的。

    p = figure(
        title="Baskets Cooked per 15min",
        y_axis_label="Count",
        plot_width=plot_width,
        plot_height=plot_height,
        toolbar_location=toolbar_loc,
    )
    p.vbar(x='dipped_time_food_type', top='COUNT', width=1e3*60*15, source=self.group.sum() )

这给出了一个空图 empty graph

如果我尝试将组对象放入x_range as per these instructions

self.p = figure(
            title="Baskets Cooked per 15min",
            y_axis_label="Count",
            plot_width=plot_width,
            plot_height=plot_height,
            toolbar_location=toolbar_loc,
            x_range=group
        )

设置数字时,尽管出现the format explained here,但出现以下错误:

ValueError: expected an element of either Seq(String), Seq(Tuple(String, String)) or Seq(Tuple(String, String, String)), got [(Timestamp('2019-07-07 11:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 11:45:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:00:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:00:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:15:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:15:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:15:00'), 'POTATO FRIES')]

我也尝试了其他几件事,但这似乎是我得到的最接近的东西。希望对数据框的结构有任何见解,或者我缺少任何其他愚蠢的错误。

感谢您的帮助!

修改 因此,我注意到最后一个错误与数据结构无关,而与数据类型有关。所以我将日期时间转换为字符串:

agg_15m = df[['dipped_time', 'food_type']] \
                .set_index('dipped_time', 'food_type') \
                .groupby('food_type') \
                .resample('15Min') \
                .agg({'food_type': 'count'}) \
                .rename(columns={'food_type':'COUNT'}) \
                .reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].to_string()
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
            title="Baskets Cooked per 15min",
            y_axis_label="Count",
            plot_width=plot_width,
            plot_height=plot_height,
            toolbar_location=toolbar_loc,
            x_range=self.group
        )
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))

现在,这给了我一个相当丑陋的图,它似乎无法表示基础数据。

ugly graph

我正在尝试达到以下目标: pretty graph

编辑

最新版本的字符串转换不正确。更新为

agg_15m = df[['dipped_time', 'food_type']] \
                .set_index('dipped_time', 'food_type') \
                .groupby('food_type') \
                .resample('15Min') \
                .agg({'food_type': 'count'}) \
                .rename(columns={'food_type':'COUNT'}) \
                .reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
            title="Baskets Cooked per 15min",
            y_axis_label="Count",
            plot_width=plot_width,
            plot_height=plot_height,
            toolbar_location=toolbar_loc,
            x_range=self.group
        )
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))

这将提供正确的数据,但是现在该图为空,角上有一些伪像。 empty graph with artifacts in corner

编辑

我无法正常运行,所以我选择了手动方法。这段代码有效:

    df['dipped_time'] = pd.to_datetime(df['dipped_time'], errors='coerce') #convert to datetime so we can resample
    #group by food and resample to 15min intervals
    agg_15m = df[['dipped_time', 'food_type']] \
                .set_index('dipped_time', 'food_type') \
                .groupby('food_type') \
                .resample('15Min') \
                .agg({'food_type': 'count'}) \
                .rename(columns={'food_type':'COUNT'}) \
                .reset_index()
    agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
    plot_width  = 800
    plot_height = 600
    toolbar_loc = 'above'

    self.p = figure(
            title="Baskets Cooked per 15min",
            y_axis_label="Count",
            plot_width=plot_width,
            plot_height=plot_height,
            toolbar_location=toolbar_loc,
            x_range=sorted(self.agg_15m.dipped_time.unique())
        )
    self.food_types = self.agg_15m.food_type.unique()
    self.data_source = dict(
            x=sorted(self.agg_15m.dipped_time.unique())
        )
    df = self.agg_15m
    for food_type in self.food_types:
            arr = []
            for time in sorted(self.agg_15m.dipped_time.unique()):
                if df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].empty:
                    arr.append(0)
                else:
                    arr.append(df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].values[0])
            self.data_source[food_type] = arr

    fill_colors=[
            Spectral5[i]
            for i in range(len(self.food_types))]

    self.p.vbar_stack(self.food_types, \
                          x='x', \
                          width=0.9, alpha=0.5, \
                          source=ColumnDataSource(self.data_source), \
                          fill_color=fill_colors,
                          legend=[value(x) for x in self.food_types])

successful graph

仍然欢迎更多惯用的解决方案。

1 个答案:

答案 0 :(得分:0)

您试图将COUNT_std绘制在条形图的顶部,但是如果您实际查看ColumnDataSource中的数据,您会发现NaN值不过是什么:

 'COUNT_std': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),

确实,如果您返回到该组,并查看group.describe()的输出,您会发现NaN来自那里:

In [40]: group.describe()
Out[40]:
                                           COUNT
                                           count mean std  min  25%  50%  75%  max
dipped_time         food_type
2019-07-07 12:30:00 POTATO FRIES             1.0  5.0 NaN  5.0  5.0  5.0  5.0  5.0
2019-07-07 12:45:00 CRISPY CHICKEN TENDERS   1.0  3.0 NaN  3.0  3.0  3.0  3.0  3.0
                    POPCORN CHICKEN          1.0  3.0 NaN  3.0  3.0  3.0  3.0  3.0
                    POTATO FRIES             1.0  4.0 NaN  4.0  4.0  4.0  4.0  4.0
2019-07-07 13:00:00 CRISPY CHICKEN TENDERS   1.0  6.0 NaN  6.0  6.0  6.0  6.0  6.0
                    POTATO FRIES             1.0  3.0 NaN  3.0  3.0  3.0  3.0  3.0
2019-07-07 13:15:00 CRISPY CHICKEN TENDERS   1.0  0.0 NaN  0.0  0.0  0.0  0.0  0.0
                    POTATO FRIES             1.0  5.0 NaN  5.0  5.0  5.0  5.0  5.0
2019-07-07 13:30:00 CRISPY CHICKEN TENDERS   1.0  6.0 NaN  6.0  6.0  6.0  6.0  6.0
                    POTATO FRIES             1.0  1.0 NaN  1.0  1.0  1.0  1.0  1.0
2019-07-07 13:45:00 POTATO FRIES             1.0  6.0 NaN  6.0  6.0  6.0  6.0  6.0
2019-07-07 14:00:00 POTATO FRIES             1.0  0.0 NaN  0.0  0.0  0.0  0.0  0.0
2019-07-07 14:15:00 POTATO FRIES             1.0  3.0 NaN  3.0  3.0  3.0  3.0  3.0

我不确定该专栏为什么会充满NaN,但这是最后一个图出现问题的直接原因。相反,如果您使用具有有效数值的列,例如COUNT_max

p.vbar(x='dipped_time_food_type', top='COUNT_max', width=0.9, source=group)

然后,您将看到与所追求的图类似的图,以任何视觉样式为模:

enter image description here

请注意,我将条形宽度设置为0.9,因此它们之间实际上存在间隔。