根据现有列的条件创建新的pandas列

时间:2020-10-22 07:02:38

标签: python pandas

我有一个数据框,如下所示:

col1 = ['a','b','c','a','c','a','b','c','a']
col2 = [1,1,0,1,1,0,1,1,0]
df2 = pd.DataFrame(zip(col1,col2),columns=['name','count'])

    name    count
0   a       1
1   b       1
2   c       0
3   a       1
4   c       1
5   a       0
6   b       1
7   c       1
8   a       0

我试图找到与“名称”列中每个元素相对应的零个数与零个总数之和+1的比率。 首先,我将计数汇总如下:

for j in df2.name.unique():
    print(j)
    zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
    full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
    zero_pb = zero_ct / full_ct
    one_pb = 1 - zero_pb
    print(f"ZERO rations for {j} = {zero_pb}")
    print(f"One ratios for {j} = {one_pb}")
    print("="*30)

输出看起来像:

a
ZERO ratios for a = 0    0.5
dtype: float64
One ratios for a = 0    0.5
dtype: float64
==============================
b
ZERO ratios for b = 1    0.0
dtype: float64
One ratios for b = 1    1.0
dtype: float64
==============================
c
ZERO ratios for c = 2    0.333333
dtype: float64
One ratios for c = 2    0.666667
dtype: float64
==============================

我的目标是向数据框中添加2个新列:“名称_0”和“名称_1”,其中“名称”列中每个元素的比率值都为th。我尝试了一些措施,但未达到预期效果:

for j in df2.name.unique():
    print(j)
    zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
    full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
    zero_pb = zero_ct / full_ct
    one_pb = 1 - zero_pb
    print(f"ZERO Probablitliy for {j} = {zero_pb}")
    print(f"One Probablitliy for {j} = {one_pb}")
    print("="*30)
    
    condition1 = [ df2['name'].eq(j) & df2['count'].eq(0)]
    condition2 = [ df2['name'].eq(j) & df2['count'].eq(1)]
    choice1 = zero_pb.tolist()
    choice2 = one_pb.tolist()

    print(f'choice1 = {choice1}, choice2 = {choice2}')
    df2["name"+str("_0")] = np.select(condition1, choice1, default=0)
    df2["name"+str("_1")] = np.select(condition2, choice2, default=0)

该列将使用名称元素'c'的值进行更新。可以预期,因为最后一次计算的值将用于更新所有值。

能否请您帮助我了解是否还有另一种有效使用np.select的方法?

预期输出:

    name    count   name_0      name_1
0   a       1       0.000000    0.500000
1   b       1       0.000000    1.000000
2   c       0       0.333333    0.000000
3   a       1       0.000000    0.500000
4   c       1       0.000000    0.666667
5   a       0       0.500000    0.000000
6   b       1       0.000000    1.000000
7   c       1       0.000000    0.666667
8   a       0       0.500000    0.000000

2 个答案:

答案 0 :(得分:1)

我无权访问zero_one_frequencies df。因此,我采取了尝试以自己的方式解决问题的自由。

import pandas as pd
import numpy as np
col1 = ['a','b','c','a','c','a','b','c','a']
col2 = [1,1,0,1,1,0,1,1,0]
df2 = pd.DataFrame(zip(col1,col2),columns=['name','count'])

df2["name_0"] = 0
df2["name_1"] = 0

for name in df2['name'].unique():
  df_name = df2[df2['name'] == name]
  prob_1 = sum(df_name['count']/df_name.shape[0])
  for count in df2['count'].unique():
    indx = np.where((df2['name'] == name) & (df2['count'] == count))
    df2["name_" + str(count)].loc[indx] = np.abs(((count +1) % 2) - prob_1)

输出:

name    count   name_0  name_1
0   a   1   0.000000    0.500000
1   b   1   0.000000    1.000000
2   c   0   0.333333    0.000000
3   a   1   0.000000    0.500000
4   c   1   0.000000    0.666667
5   a   0   0.500000    0.000000
6   b   1   0.000000    1.000000
7   c   1   0.000000    0.666667
8   a   0   0.500000    0.000000

为了解np。选择我建议您参阅this post

答案 1 :(得分:0)

以下代码解决了该问题。但是,我找不到使用numpy.select来获得相同效果的方法。

df2["name"+str("_0")] = 0.0
df2["name"+str("_1")] = 0.0
for j in df2.name.unique():
    print(j)
    zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
    full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
    zero_pb = zero_ct / full_ct
    one_pb = 1 - zero_pb
    print(f"ZERO Probablitliy for {j} = {zero_pb.tolist()[0]}")
    print(f"One Probablitliy for {j} = {one_pb.tolist()[0]}")
    print("="*30)
    for idx in df2[df2['name']== j ].index:
        print("Index:::", idx)
        if df2['count'].iloc[idx] == 0:
            df2.at[idx, "name"+str("_0")] = zero_pb.tolist()[0]
            print(f'Count for {j} at index {idx} is {a}')
            print('printing name_0: ', df2["name"+str("_0")].iloc[idx])
            print("*"*30)
        elif df2['count'].iloc[idx] == 1:
            df2.at[idx, "name"+str("_1")] = one_pb.tolist()[0]
            print(f'Count for {j} at index {idx} is {b}')
            print('printing name_1: ', df2["name"+str("_1")].iloc[idx])
            print("*"*30)