在Python中生成包含多索引grouped_by数据框对象的随机数据

时间:2019-02-27 15:21:07

标签: python pandas numpy pandas-groupby

下表具有有关每个领导者和费用类型的费用的摘要统计信息。我将稳定存储在python中作为多索引数据框对象。我的目标是使用每个类别下的均值和标准差为每种领导者和费用类型生成随机数据(下面的运行代码段以获取表格)。有一个“计数”列,表示我想为每种Leader-Expense_Type组合生成多少个随机数。我提出了广泛而效率低下的循环结构,这些结构似乎无法正确完成工作。我应该如何解决这个问题?

注意:这只是数据的一个示例。具有更多费用类型的领导者也更多。

<table border="1" class="dataframe">  <thead>    <tr>      <th></th>      <th></th>      <th colspan="3" halign="left">Expense_Amount</th>    </tr>    <tr>      <th></th>      <th></th>      <th>mean</th>      <th>std</th>      <th>count</th>    </tr>    <tr>      <th>Leader</th>      <th>Expense_Type</th>      <th></th>      <th></th>      <th></th>    </tr>  </thead>  <tbody>    <tr>      <th rowspan="7" valign="top">Leader1</th>      <th>Airfare</th>      <td>1979.684219</td>      <td>2731.629767</td>      <td>1358</td>    </tr>    <tr>      <th>Booking Fees</th>      <td>118.994538</td>      <td>270.007390</td>      <td>1179</td>    </tr>    <tr>      <th>Conference/Seminars</th>      <td>1553.830923</td>      <td>1319.295946</td>      <td>65</td>    </tr>    <tr>      <th>Hotel</th>      <td>1656.643658</td>      <td>2104.721093</td>      <td>1405</td>    </tr>    <tr>      <th>Meals</th>      <td>435.665122</td>      <td>676.705857</td>      <td>1476</td>    </tr>    <tr>      <th>Mileage</th>      <td>213.785046</td>      <td>284.908031</td>      <td>979</td>    </tr>    <tr>      <th>Taxi/Uber</th>      <td>308.530724</td>      <td>380.288964</td>      <td>1422</td>    </tr>    <tr>      <th rowspan="7" valign="top">Leader2</th>      <th>Airfare</th>      <td>1730.196911</td>      <td>2334.688155</td>      <td>628</td>    </tr>    <tr>      <th>Booking Fees</th>      <td>112.020556</td>      <td>573.407269</td>      <td>576</td>    </tr>    <tr>      <th>Conference/Seminars</th>      <td>1647.576500</td>      <td>1154.320584</td>      <td>80</td>    </tr>    <tr>      <th>Hotel</th>      <td>1693.080356</td>      <td>1953.552474</td>      <td>618</td>    </tr>    <tr>      <th>Meals</th>      <td>574.228548</td>      <td>844.997595</td>      <td>620</td>    </tr>    <tr>      <th>Mileage</th>      <td>215.898798</td>      <td>291.231331</td>      <td>466</td>    </tr>    <tr>      <th>Taxi/Uber</th>      <td>298.655852</td>      <td>340.926518</td>      <td>569</td>    </tr>  </tbody></table>

2 个答案:

答案 0 :(得分:0)

您可以将df.apply(your_function, axis=1)

一起使用
def your_function(df):
    mean = df['mean']
    std = df['std']
    result = mean  # Replace with your number generator
    return result

有关更多详细说明,请参见以下答案:How to apply a function to two columns of Pandas dataframe

答案 1 :(得分:0)

这是我的解决方案:

# Dictionary to hold generated data
rand_expenses_dict = {}

# Loop over each unique leader
for leader in agg_data.index.get_level_values("Leader").unique():

# Loop over each unique expense type
for expense_type in agg_data.index.get_level_values("Expense_Type").unique():

    # Not al leaders have all expense types
    # The exception handling method will ignore expense types
    # That do not correspond to a leader
    try:

        # Generate random numbers
        rand = (np.round(
                        np.random.normal(
                            loc=agg_data.loc[leader, expense_type][0],
                            scale = agg_data.loc[leader, expense_type][1],
                            size  = int(agg_data.loc[leader, expense_type][2])
                        ),2))

        # Add random numbers to data dictionaty
        rand_expenses_dict[(leader,expense_type)] = rand

    # If it finds an error, go to the next expense
    except:
        pass