Question

您好我是使用来自SAS背景的pandas的新手，我正在尝试使用以下代码将连续变量分段为band。

var_range = df['BILL_AMT1'].max() - df['BILL_AMT1'].min()
a= 10
for i in range(1,a):
    inc = var_range/a
    lower_bound = df['BILL_AMT1'].min() + (i-1)*inc
    print('Lower bound is '+str(lower_bound))
    upper_bound = df['BILL_AMT1'].max() + (i)*inc
    print('Upper bound is '+str(upper_bound))
    if (lower_bound <= df['BILL_AMT1'] < upper_bound):
        df['bill_class'] = i
    i+=1

我希望代码能够检查df['BILL_AMT1']的值是否在当前循环边界内，并相应地设置df['bill_class']。

我收到以下错误：

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我认为if条件正在正确评估，但错误是由于为新的列分配了for循环计数器的值。

任何人都可以解释出现问题或提出替代方案。

Answer 1

要避免ValueError，请更改

if (lower_bound <= df['BILL_AMT1'] < upper_bound):
    df['bill_class'] = i

到

mask = (lower_bound <= df['BILL_AMT1']) & (df['BILL_AMT1'] < upper_bound)
df.loc[mask, 'bill_class'] = i

chained comparison (lower_bound <= df['BILL_AMT1'] < upper_bound)相当于

(lower_bound <= df['BILL_AMT1']) and (df['BILL_AMT1'] < upper_bound)

and运算符导致在布尔上下文中计算两个布尔系列(lower_bound <= df['BILL_AMT1'])，(df['BILL_AMT1'] < upper_bound) - 即减少为单个布尔值。 Pandas refuses to reduce系列为一个布尔值。

相反，要返回布尔系列，请使用&运算符代替and：

mask = (lower_bound <= df['BILL_AMT1']) & (df['BILL_AMT1'] < upper_bound)

然后将bill_class列为mask列的df.loc列分配值，使用df.loc[mask, 'bill_class'] = i：

df['BILL_AMT1']

要将for-loop中的数据分区，您可以完全删除Python pd.cut，而DSM suggests使用df['bill_class'] = pd.cut(df['BILL_AMT1'], bins=10, labels=False)+1：

SELECT SUM(products.price) FROM boughtProducts, products WHERE boughtProducts.userid = :username and products.id = boughtProducts.productId

Answer 2

IIUC，这应该是对您的代码的修复：

mx, mn = df['BILL_AMT1'].max(), df['BILL_AMT1'].min()
rng = mx - mn
a = 10

for i in range(a):
    inc = rng / a
    lower_bound = mn + i * inc
    print('Lower bound is ' + str(lower_bound))
    upper_bound = mn + (i + 1) * inc if i + 1 < a else mx
    print('Upper bound is ' + str(upper_bound))
    ge = df['BILL_AMT1'].ge(lower_bound)
    lt = df['BILL_AMT1'].lt(upper_bound)
    df.loc[ge & lt, 'bill_class'] = i

<强> 然而
我这样做

df['bill_class'] = pd.qcut(df['BILL_AMT1'], 10, list(range(10)))

为pandas中的数据帧中的每一行循环IF语句

2 个答案: