我想要一个功能,例如{1}},在给定pandas DataFrame get_cluster(df, numspan)
和整数df
作为输入的情况下,返回标签(数字)的DataFrame numspan
,表示根据此计算的子集中的成员资格到DataFrame的max和min之间的差值除以numspan。
换句话说:
df_cluster
(未必订购,可能是实数)1, 2, 3, 4, 5
和分钟5
1
,表示主要设置宽度5 - 1 = 4
获取子集单位宽度2
2
(规则中包含与最大上限对应的最后一个标签)我的代码(另一个例子,见下图):
1, 1, 2, 2, 2
图片中:
非常感谢你的帮助和时间,
吉尔伯托
感谢@Boud,快速而优雅的解决方案是:
import pandas as pd
df = pd.DataFrame({'A':pd.Series([4, 8, 2, 3])})
def get_cluster(df, numspan):
min = df.min() # e.g. 2
max = df.max() # e.g. 8
span = max - min # e.g. 6
subset_unit = span/numspan # e.g. 6/3 = 2 -> every subset is 2 width
# code I need...
return df_cluster
df['Cluster'] = get_cluster(df, 3)
df
A Cluster
0 4 2
1 8 3 <= included by rule
2 2 1
3 3 1
答案 0 :(得分:1)
This is called pd.cut
where a bins=
argument will allow you to set the number you numspan in the question.
It returns bin ranges by default. labels=False
is a parameter you can use to get a bin number instead.