我知道我已经接近这一点,但是我只是无法散景来做我想要的。我需要将时间数据重新采样为15分钟的间隔,然后按层次,分类类型对其进行分组,并绘制时间组中的结果图。将不胜感激。
我有如下数据:
basket_id food_type classified_time dipped_time slot_number
0 185261 CHICKEN FILLETS 2019-07-07 11:38:23.153858 2019-07-07 11:38:40.271070 8
1 185263 CHICKEN FILLETS 2019-07-07 11:38:25.831668 2019-07-07 11:38:53.265553 4
2 185273 CRISPY CHICKEN TENDERS 2019-07-07 11:39:26.184932 2019-07-07 11:39:58.164302 5
3 185276 CRISPY CHICKEN TENDERS 2019-07-07 11:39:30.178273 2019-07-07 11:39:46.076617 1
...
我可以重新采样这些数据,以便获得此结果,看起来非常正确:
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
display(agg_15m)
然后我可以使用groupby来获得我认为正确的结构:
group = agg_15m.groupby(['dipped_time', 'food_type'])
display(group.sum())
仅此一项就需要在数据帧中进行大量的计算,因为我并不真正熟悉使用多索引数据的概念。
现在好玩的是,尝试让Bokeh对这些数据进行处理。 This instruction from bokeh似乎提供了正确的方向;但是,它仅使用单个groupby。 This instruction from bokeh为分层分类数据提供了一些指导,但是该示例仅使用文字来完成。
这就是我尝试过的。
p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
)
p.vbar(x='dipped_time_food_type', top='COUNT', width=1e3*60*15, source=self.group.sum() )
如果我尝试将组对象放入x_range as per these instructions
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=group
)
设置数字时,尽管出现the format explained here,但出现以下错误:
ValueError: expected an element of either Seq(String), Seq(Tuple(String, String)) or Seq(Tuple(String, String, String)), got [(Timestamp('2019-07-07 11:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 11:45:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 11:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 11:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 11:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:00:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:00:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:15:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:15:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:30:00'), 'CHICKEN FILLETS'), (Timestamp('2019-07-07 12:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:30:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 12:45:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 12:45:00'), 'POPCORN CHICKEN'), (Timestamp('2019-07-07 12:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:00:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:15:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:15:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:30:00'), 'CRISPY CHICKEN TENDERS'), (Timestamp('2019-07-07 13:30:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 13:45:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:00:00'), 'POTATO FRIES'), (Timestamp('2019-07-07 14:15:00'), 'POTATO FRIES')]
我也尝试了其他几件事,但这似乎是我得到的最接近的东西。希望对数据框的结构有任何见解,或者我缺少任何其他愚蠢的错误。
感谢您的帮助!
修改 因此,我注意到最后一个错误与数据结构无关,而与数据类型有关。所以我将日期时间转换为字符串:
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].to_string()
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=self.group
)
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))
现在,这给了我一个相当丑陋的图,它似乎无法表示基础数据。
编辑
最新版本的字符串转换不正确。更新为
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
self.group = agg_15m.groupby(['dipped_time', 'food_type'])
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=self.group
)
self.p.vbar(x='dipped_time_food_type', top='COUNT_std', width=1, source=ColumnDataSource(self.group))
编辑
我无法正常运行,所以我选择了手动方法。这段代码有效:
df['dipped_time'] = pd.to_datetime(df['dipped_time'], errors='coerce') #convert to datetime so we can resample
#group by food and resample to 15min intervals
agg_15m = df[['dipped_time', 'food_type']] \
.set_index('dipped_time', 'food_type') \
.groupby('food_type') \
.resample('15Min') \
.agg({'food_type': 'count'}) \
.rename(columns={'food_type':'COUNT'}) \
.reset_index()
agg_15m['dipped_time'] = agg_15m['dipped_time'].astype(str)
plot_width = 800
plot_height = 600
toolbar_loc = 'above'
self.p = figure(
title="Baskets Cooked per 15min",
y_axis_label="Count",
plot_width=plot_width,
plot_height=plot_height,
toolbar_location=toolbar_loc,
x_range=sorted(self.agg_15m.dipped_time.unique())
)
self.food_types = self.agg_15m.food_type.unique()
self.data_source = dict(
x=sorted(self.agg_15m.dipped_time.unique())
)
df = self.agg_15m
for food_type in self.food_types:
arr = []
for time in sorted(self.agg_15m.dipped_time.unique()):
if df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].empty:
arr.append(0)
else:
arr.append(df.loc[(df["dipped_time"]==time) & (df["food_type"]==food_type), "COUNT"].values[0])
self.data_source[food_type] = arr
fill_colors=[
Spectral5[i]
for i in range(len(self.food_types))]
self.p.vbar_stack(self.food_types, \
x='x', \
width=0.9, alpha=0.5, \
source=ColumnDataSource(self.data_source), \
fill_color=fill_colors,
legend=[value(x) for x in self.food_types])
仍然欢迎更多惯用的解决方案。
答案 0 :(得分:0)
您试图将COUNT_std
绘制在条形图的顶部,但是如果您实际查看ColumnDataSource
中的数据,您会发现NaN值不过是什么:
'COUNT_std': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
确实,如果您返回到该组,并查看group.describe()
的输出,您会发现NaN来自那里:
In [40]: group.describe()
Out[40]:
COUNT
count mean std min 25% 50% 75% max
dipped_time food_type
2019-07-07 12:30:00 POTATO FRIES 1.0 5.0 NaN 5.0 5.0 5.0 5.0 5.0
2019-07-07 12:45:00 CRISPY CHICKEN TENDERS 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
POPCORN CHICKEN 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
POTATO FRIES 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2019-07-07 13:00:00 CRISPY CHICKEN TENDERS 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
POTATO FRIES 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
2019-07-07 13:15:00 CRISPY CHICKEN TENDERS 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0
POTATO FRIES 1.0 5.0 NaN 5.0 5.0 5.0 5.0 5.0
2019-07-07 13:30:00 CRISPY CHICKEN TENDERS 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
POTATO FRIES 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0
2019-07-07 13:45:00 POTATO FRIES 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
2019-07-07 14:00:00 POTATO FRIES 1.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0
2019-07-07 14:15:00 POTATO FRIES 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
我不确定该专栏为什么会充满NaN,但这是最后一个图出现问题的直接原因。相反,如果您使用具有有效数值的列,例如COUNT_max
:
p.vbar(x='dipped_time_food_type', top='COUNT_max', width=0.9, source=group)
然后,您将看到与所追求的图类似的图,以任何视觉样式为模:
请注意,我将条形宽度设置为0.9,因此它们之间实际上存在间隔。