pandas cut()
documentation表示:"超出范围的值将在生成的分类对象中为NA。"当上限不一定清楚或重要时,这使得困难。例如:
cut (weight, bins=[10,50,100,200])
会产生垃圾箱:
[(10, 50] < (50, 100] < (100, 200]]
所以cut (250, bins=[10,50,100,200])
会生成NaN
,cut (5, bins=[10,50,100,200])
也会生成> 200
。我尝试做的是为第一个例子生成< 10
,为第二个例子生成cut (weight, bins=[float("inf"),10,50,100,200,float("inf")])
。
我意识到我可以(200, inf]
或等效,但我所遵循的报告风格并不允许labels
之类的内容。我也意识到我可以通过cut()
上的bins
参数实际指定自定义标签,但这意味着每次调整cut()
时都要记住调整它们,这可能经常发生。
我是否已经用尽所有可能性,或pandas
或cut()
中的其他地方有什么可以帮助我这样做?我正在考虑为doGet(e)
编写一个包装函数,它会自动生成所需格式的标签,但我想先在这里查看。
答案 0 :(得分:16)
您可以使用float("inf")
作为上限,-float("inf")
作为列表列表的下限。它将删除NaN值。
答案 1 :(得分:9)
等了几天之后,仍然没有发布答案 - 我认为这可能是因为除了编写cut()
包装函数之外别无他法。我在这里发布我的版本并将问题标记为已回答。如果有新的答案,我会改变它。
def my_cut (x, bins,
lower_infinite=True, upper_infinite=True,
**kwargs):
r"""Wrapper around pandas cut() to create infinite lower/upper bounds with proper labeling.
Takes all the same arguments as pandas cut(), plus two more.
Args :
lower_infinite (bool, optional) : set whether the lower bound is infinite
Default is True. If true, and your first bin element is something like 20, the
first bin label will be '<= 20' (depending on other cut() parameters)
upper_infinite (bool, optional) : set whether the upper bound is infinite
Default is True. If true, and your last bin element is something like 20, the
first bin label will be '> 20' (depending on other cut() parameters)
**kwargs : any standard pandas cut() labeled parameters
Returns :
out : same as pandas cut() return value
bins : same as pandas cut() return value
"""
# Quick passthru if no infinite bounds
if not lower_infinite and not upper_infinite:
return pd.cut(x, bins, **kwargs)
# Setup
num_labels = len(bins) - 1
include_lowest = kwargs.get("include_lowest", False)
right = kwargs.get("right", True)
# Prepend/Append infinities where indiciated
bins_final = bins.copy()
if upper_infinite:
bins_final.insert(len(bins),float("inf"))
num_labels += 1
if lower_infinite:
bins_final.insert(0,float("-inf"))
num_labels += 1
# Decide all boundary symbols based on traditional cut() parameters
symbol_lower = "<=" if include_lowest and right else "<"
left_bracket = "(" if right else "["
right_bracket = "]" if right else ")"
symbol_upper = ">" if right else ">="
# Inner function reused in multiple clauses for labeling
def make_label(i, lb=left_bracket, rb=right_bracket):
return "{0}{1}, {2}{3}".format(lb, bins_final[i], bins_final[i+1], rb)
# Create custom labels
labels=[]
for i in range(0,num_labels):
new_label = None
if i == 0:
if lower_infinite:
new_label = "{0} {1}".format(symbol_lower, bins_final[i+1])
elif include_lowest:
new_label = make_label(i, lb="[")
else:
new_label = make_label(i)
elif upper_infinite and i == (num_labels - 1):
new_label = "{0} {1}".format(symbol_upper, bins_final[i])
else:
new_label = make_label(i)
labels.append(new_label)
# Pass thru to pandas cut()
return pd.cut(x, bins_final, labels=labels, **kwargs)