我有以下数据:
[4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8]
我需要根据上面的数据构建其计数/频率表:
4.1 - 4.5: 8
4.6 - 5.0: 4
5.1 - 5.5: 10
5.6 - 6.0: 6
6.1 - 6.5: 7
6.6 - 7.0: 5
我能得到的最接近的结果是:
counts freqs
categories
[4.1, 4.6) 8 0.200
[4.6, 5.1) 4 0.100
[5.1, 5.6) 10 0.250
[5.6, 6.1) 6 0.150
[6.1, 6.6) 7 0.175
[6.6, 7.1) 5 0.125
通过此代码:
sr = [4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8]
ncut = pd.cut(sr, [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1],right=False)
srpd = pd.DataFrame(ncut.describe())
我需要创建一个新列,该列是“类别”值的中位数(例如,对于“ [4.1,4.6)”,其中包含从4.1到4.5的数据计数/频率(不包括4.6)) ,所以我需要得到(4.1 + 4.5)/ 2,等于4.3。
这是我的问题:
1)如何访问“类别”索引下的值以将其用于上述计算?
2)有没有办法以这种方式反映范围:4.1-4.5、4.6到5.0等??
3)对于这样的分组数据,有没有更简单的方法来计算均值,中位数,众数等?还是必须在Python中为这些函数创建自己的函数?
谢谢
答案 0 :(得分:2)
关于您的垃圾箱和标签问题,以下内容如何:
bins = [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]
labels = ['{}-{}'.format(x, y-.1) for x, y in zip(bins[:], bins[1:])]
然后将它们的值而不是列表中的值设为Series
sr = pd.Series([4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1,
5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7,
5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8])
ncut = pd.cut(sr, bins=bins, labels=labels, right=False)
定义一个lambda
函数以计算频率
freq = lambda x: len(x) / x.sum()
freq.__name__ = 'freq'
最后,使用concat
,groupby
和agg
来获取每个bin的摘要统计信息
pd.concat([ncut, sr], axis=1).groupby(0).agg(['size', 'std', 'mean', freq])
答案 1 :(得分:1)
让我们尝试一下:
l = [4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9,
5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6,
5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6,
6.7, 6.7, 6.8, 6.8]
s = pd.Series(l)
bins = [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]
#Python 3.6+ f-string
labels = [f'{i}-{j-.1}' for i,j in zip(bins,bins[1:])]
(pd.concat([pd.cut(s, bins=bins, labels=labels, right=False),s],axis=1)
.groupby(0)[1]
.agg(['mean','median', pd.Series.mode, 'std'])
.rename_axis('categories')
.reset_index())
输出:
categories mean median mode std
0 4.1-4.5 4.250000 4.25 4.1 0.151186
1 4.6-5.0 4.725000 4.70 4.6 0.150000
2 5.1-5.5 5.280000 5.30 5.3 0.131656
3 5.6-6.0 5.700000 5.65 5.6 0.126491
4 6.1-6.5 6.314286 6.30 6.2 0.121499
5 6.6-7.0 6.720000 6.70 [6.7, 6.8] 0.083666
答案 2 :(得分:0)
我有点想办法了:
def buildFreqTable(data, width, numclass, pw):
data.sort()
minrange = []
maxrange = []
x_med = []
count = []
# Since data is already sorted, take the lowest value to jumpstart the creation of ranges
f_data = data[0]
for i in range(0,numclass):
# minrange holds the minimum value for that row
minrange.append(f_data)
# maxrange holds the maximum value for that row
maxrange.append(f_data + (width - pw))
# Compute for range's median
minmax_median = (minrange[i] + maxrange[i]) / 2
x_med.append(minmax_median)
# initialize count per numclass to 0, this will be incremented later
count.append(0)
f_data = f_data + width
# Tally the frequencies
for x in data:
for i in range(0,6):
if (x>=minrange[i] and x<=maxrange[i]):
count[i] = count[i] + 1
# Now, create the pandas dataframe for easier manipulation
freqtable = pd.DataFrame()
freqtable['minrange'] = minrange
freqtable['maxrange'] = maxrange
freqtable['x'] = x_med
freqtable['count'] = count
buildFreqTable(sr, 0.5, 6, 0.1)
它发出以下信息:
minrange maxrange x count
0 4.1 4.5 4.3 8
1 4.6 5.0 4.8 4
2 5.1 5.5 5.3 10
3 5.6 6.0 5.8 6
4 6.1 6.5 6.3 7
5 6.6 7.0 6.8 5
尽管我仍然想知道是否有更简单的方法来执行此操作,或者是否有人可以将我的代码重构为更“亲”,谢谢