熊猫:在新栏中添加分位数

时间:2018-08-14 16:50:29

标签: python pandas dataframe quantile

我有一个包含三列的数据框

| A | B | C |

我计算了分位数:

function colSpamDemo( )
{
  let wb = new ExcelJS.Workbook();
  let ws = wb.addWorksheet('Export');

  ws.getCell('A1:C1').value = 'This price list supercedes all prior price lists.';
  ws.getCell('A1:C1').alignment = { horizontal:'center'} ;
}

我想添加一个新列df.quantile(.25) df.quantile(.75) ,它根据一条简单的规则使用Q进行分类。如果值小于1个四分位数,则较小;如果它大于3个四分位数,则表示它很大,并且介于两者之间的所有内容都是中等。

我尝试使用qcut,但它仅接收一维输入。

谢谢

2 个答案:

答案 0 :(得分:3)

pd.qcut是你的朋友。

pd.qcut(s, q=[0, .25, .75, 1], labels=['small', 'medium', 'large'])

MWE

print(s)
0     1
1     1
2     2
3     3
4     4
5     2
6     4
7     6
8     4
9     6
10    5
11    4
12    6
13    7
14    3
15    2
16    1
17    1
18    2
dtype: int64

print (pd.qcut(s, q=[0, .25, .75, 1], labels=['small', 'medium', 'large']))
0      small
1      small
2      small
3     medium
4     medium
5      small
6     medium
7      large
8     medium
9      large
10     large
11    medium
12     large
13     large
14    medium
15     small
16     small
17     small
18     small
dtype: category
Categories (3, object): [small < medium < large]

对于DataFrame,请对每列重复apply

df.apply(pd.qcut, q=[0, .25, .75, 1], labels=['small', 'medium', 'large'], axis=0)

答案 1 :(得分:1)

设置

np.random.seed([3, 1415])
df = pd.DataFrame(
    np.random.randint(10, size=(10, 3)),
    columns=list('ABC')
)

pandas.DataFrame.mask

仅熊猫且直观

is_small = df < df.quantile(.25)
is_large = df > df.quantile(.75)
is_medium = ~(is_small | is_large)

df.mask(is_small, 'small').mask(is_large, 'large').mask(is_medium, 'medium')

        A       B       C
0   small   small  medium
1  medium   large  medium
2   small   large   large
3  medium   small   small
4   small  medium   large
5   large  medium   small
6  medium  medium  medium
7  medium   large  medium
8  medium  medium  medium
9   large  medium   large

嵌套numpy.where

is_small = df < df.quantile(.25)
is_large = df > df.quantile(.75)

pd.DataFrame(
    np.where(is_small, 'small', np.where(is_large, 'large', 'medium')),
    df.index, df.columns
)

        A       B       C
0   small   small  medium
1  medium   large  medium
2   small   large   large
3  medium   small   small
4   small  medium   large
5   large  medium   small
6  medium  medium  medium
7  medium   large  medium
8  medium  medium  medium
9   large  medium   large