收集通过随机抽样其他数据框架构建的Dataframe的摘要统计信息

时间:2016-10-16 00:44:08

标签: python loops pandas dictionary

我的目标是通过从其他数据框中随机抽样来构建数据框,收集有关新数据框的摘要统计信息,然后将这些统计信息附加到列表中。理想情况下,我可以多次遍历此过程(例如bootstrap)。

dfposlist = [OFdf, Firstdf, Seconddf, Thirddf, CFdf, RFdf, Cdf, SSdf]

OFdf.head()
    playerID    OPW         POS salary
87  bondsba01   62.061290   OF  8541667
785 ramirma02   35.785630   OF  13050000
966 walkela01   30.644305   OF  6050000
859 sheffga01   29.090699   OF  9916667
357 gilesbr02   28.160054   OF  7666666

列表中的所有数据帧都具有相同的标题。我试图做的事情看起来像这样:

teamdist = []
for df in dfposlist:
    frames = [df.sample(n=1)]
team = pd.concat(frames)

teamopw = team['OPW'].sum()
teamsal = team['salary'].sum()
teamplayers = team['playerID'].tolist()

teamdic = {'Salary':teamsal, 'OPW':teamopw, 'Players':teamplayers}
teamdist.append(teamdic)

我正在寻找的输出是这样的:

teamdist = [{'Salary':4900000, 'OPW':78.452, 'Players':[bondsba01, etc, etc]}]

但由于某种原因,teamopw = team['OPW'].sum()之类的总和操作无效,我只想返回team['OPW']

中的元素
print(teamopw)
0.17118131814601256
38.10700006434629
1.5699939126695253
32.9068837019903
16.990760776263674
18.22428871113601
13.447706356730897

有关如何使这项工作的任何建议?谢谢!

编辑:工作解决方案如下。不确定它是否是最pythonic的方式,但它的工作原理。

teamdist = []
team = pd.concat([df.sample(n=1) for df in dfposlist])

teamopw = team[['OPW']].values.sum()
teamsal = team[['salary']].values.sum()
teamplayers = team['playerID'].tolist()

teamdic = {'Salary':teamsal, 'OPW':teamopw, 'Players':teamplayers}
teamdist.append(teamdic)

1 个答案:

答案 0 :(得分:2)

这里(随机数据):

import pandas as pd
import numpy as np

dfposlist = dict(zip(range(10),
                     [pd.DataFrame(np.random.randn(10, 5),
                                   columns=list('abcde'))
                     for i in range(10)]))
for df in dfposlist.values():
    df['f'] = list('qrstuvwxyz')

teamdist = []
team = pd.concat([df.sample(n=1) for df in dfposlist.values()])
print(team.info())

teamdic = team[['a', 'c', 'e']].sum().to_dict()
teamdic['f'] = team['f'].tolist()
teamdist.append(teamdic)
print(teamdist)

# Output:
## team.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 1 to 6
Data columns (total 6 columns):
a    10 non-null float64
b    10 non-null float64
c    10 non-null float64
d    10 non-null float64
e    10 non-null float64
f    10 non-null object
dtypes: float64(5), object(1)
memory usage: 560.0+ bytes
None

## teamdist:
[{'a': -3.5380097363724601,
  'c': 2.0951152809401776,
  'e': 3.1439230427971863,
  'f': ['r', 'w', 'z', 'v', 'x', 'q', 't', 'q', 'v', 'w']}]