我有一个名为 stroke_data_complete
的数据框,我们使用以下代码对变量进行分箱;
#Cut into 4 bins of equal frequency counts
stroke_data_complete['glucose_level_quartile'] = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4)
stroke_data_complete['glucose_level_quartile'].value_counts();
当我检查这个新列的数据类型时;
stroke_data_complete['glucose_level_quartile'].dtypes
我们得到
CategoricalDtype(categories=[(55.119, 77.245], (77.245, 91.885], (91.885, 114.09], (114.09, 271.74]],
ordered=True)
接下来,我必须过滤这个新变量的值之一,这是我的代码;
stroke_data_complete.loc[stroke_data_complete.glucose_level_quartile==(114.09, 271.74]]
但我收到以下错误;
SyntaxError: closing parenthesis ']' does not match opening parenthesis '(
如果我在过滤时在它周围加上引号,我会得到空输出。我能否在如何过滤这个新定义的分箱变量方面获得一些帮助。谢谢
答案 0 :(得分:0)
试试这个:
stroke_data_complete['glucose_level_quartile'] = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4, labels=False)
stroke_data_complete.loc[stroke_data_complete.glucose_level_quartile==3]
labels=False
确保该列包含四分位数的索引,而不是值。
如果没有 labels=False
,qcut
会返回一个分类的 Series
。底层数组是一个 CategoricalArray
。数组可通过 Series.array
属性访问,其 API 为 here
在您的示例中:
quartiles = pd.qcut(stroke_data_complete['avg_glucose_level'], q=4)
quartiles = quartiles.array
stroke_data_q_3 = stroke_data_complete.loc[quartiles.codes == 3]
avg_glucose_level_interval_q_3 = quartiles.categories[3]
希望能帮到你