Question

我正在尝试将值从一个数据帧中的一列随机分配到12个不同类别（按年龄和性别）的另一个数据帧。例如，我有两个数据帧;让我们调用一个d1和另一个d2

  d1:
index agerange gender income
 0     2        1      56700
 1     2        0      25600
 2     4        0      3000
 3     4        0      106000
 4     3        0      200
 5     3        0      43000
 6     4        0      10000000

d2:
index agerange gender 
 0     3        0      
 1     2        0      
 2     4        0      
 3     4        0

我想按年龄和性别对两个数据帧进行分组，即0-1,2,3,4,5,6＆amp; 1-1,2,3,4,5,6然后随机选择d1中的一个收入并将其分配给d2。

即：

d1:
index agerange gender income
 0     2        1      56700
 1     2        0      25600
 2     4        0      3000
 3     4        0      106000
 4     3        0      200
 5     3        0      43000
 6     4        0      10000000

d2:
index agerange gender  income
 0     3        0      200  
 1     2        0      25600 
 2     4        0      10000000
 3     4        0      3000

Answer 1

选项1
let foo = 1 bar = 2 return (foo, bar)和np.random.choice的方法我隐含假设我们为每一行替换随机绘制的值。

pd.DataFrame.query

选项2
尝试提高每组呼叫def take_one(x): q = 'agerange == {agerange} and gender == {gender}'.format(**x) return np.random.choice(d1.query(q).income) d2.assign(income=d2.apply(take_one, 1)) agerange gender income index 0 3 0 200 1 2 0 25600 2 4 0 106000 3 4 0 106000一次的效率。

np.random.choice

调试和设置

g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.get(x.name, [0] * len(x)), len(x)), x.index)
d2.groupby(['agerange', 'gender'], group_keys=False).apply(f)

       agerange  gender    income
index                            
0             3       0       200
1             2       0     25600
2             4       0  10000000
3             4       0    106000

import pandas as pd
import numpy as np

d1 = pd.DataFrame({
        'agerange': [2, 2, 4, 4, 3, 3, 4],
        'gender': [1, 0, 0, 0, 0, 0, 0],
        'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000]
    }, pd.Index([0, 1, 2, 3, 4, 5, 6], name='index')
)

d2 = pd.DataFrame(
    {'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0]},
    pd.Index([0, 1, 2, 3], name='index')
)

g = d1.groupby(['agerange', 'gender']).income.apply(list)
f = lambda x: pd.Series(np.random.choice(g.loc[x.name], len(x)), x.index)
d2.assign(income=d2.groupby(['agerange', 'gender'], group_keys=False).apply(f))

Answer 2

如何根据年龄范围创建收入字典，然后映射随机选择，即

#Based on unutbu's data
df1 = pd.DataFrame({'agerange': [2, 2, 4, 4, 3, 3, 4], 'gender': [1, 0, 0, 0, 0, 0, 0], 'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000], 'index': [0, 1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0], 'index': [0, 1, 2, 3]})

age_groups = df1.groupby('agerange')['income'].agg(lambda x: tuple(x)).to_dict()
df2['income'] = df2['agerange'].map(lambda x: np.random.choice(age_groups[x]))

输出：

  agerange  gender  index  income
0         3       0      0   43000
1         2       0      1   25600
2         4       0      2  106000
3         4       0      3  106000

如果还需要性别组，那么你可以使用申请，如果你想填写0，找不到你可以使用的密钥，如果没有，即

df2 = pd.DataFrame({'agerange': [3, 2, 6, 4], 'gender': [0, 0, 0, 0], 'index': [0, 1, 2, 3]})
df1 = pd.DataFrame({'agerange': [2, 2, 4, 4, 3, 3, 4], 'gender': [1, 0, 0, 0, 0, 0, 0], 'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000], 'index': [0, 1, 2, 3, 4, 5, 6]})


age_groups = df1.groupby(['agerange','gender'])['income'].agg(lambda x: tuple(x)).to_dict()
df2['income'] = df2.apply(lambda x: np.random.choice(age_groups[x['agerange'],x['gender']]) if (x['agerange'],x['gender']) in age_groups else 0,axis=1)

输出：

   agerange  gender  index  income
0         3       0      0   43000
1         2       0      1   25600
2         6       0      2       0
3         4       0      3  106000

Answer 3

d2['income'] = d2.apply(lambda x: d1.loc[(d1.agerange==x.agerange) &(d1.gender == x.gender),'income'].sample(n=1).max(),axis=1)

输出：

   index  agerange  gender  income
0      0         3       0     200
1      1         2       0   25600
2      2         4       0    3000
3      3         4       0  106000

如何在数据帧之间随机分配值

3 个答案: