如何获取各列中值的分布?
我正在尝试获取每一列中值的百分比分布。 假设我有此数据:
Date Region Rep Item Units Unit Cost Total
1/6/18 East Jones Pencil 95 1.99 189.05
1/23/18 Central Kivell Binder 50 19.99 999.50
2/9/18 Central Jardine Pencil 36 4.99 179.64
2/26/18 Central Gill Pen 27 19.99 539.73
3/15/18 West Sorvino Pencil 56 2.99 167.44
4/1/18 East Jones Binder 60 4.99 299.40
4/18/18 Central Andrews Pencil 75 1.99 149.25
5/5/18 Central Jardine Pencil 90 4.99 449.10
5/22/18 West Thompson Pencil 32 1.99 63.68
6/8/18 East Jones Binder 60 8.99 539.40
6/25/18 Central Morgan Pencil 90 4.99 449.10
7/12/18 East Howard Binder 29 1.99 57.71
7/29/18 East Parent Binder 81 19.99 1,619.19
8/15/18 East Jones Pencil 35 4.99 174.65
9/1/18 Central Smith Desk 2 125.00 250.00
9/18/18 East Jones Pen Set 16 15.99 255.84
10/5/18 Central Morgan Binder 28 8.99 251.72
10/22/18 East Jones Pen 64 8.99 575.36
11/8/18 East Parent Pen 15 19.99 299.85
11/25/18 Central Kivell Pen Set 96 4.99 479.04
12/12/18 Central Smith Pencil 67 1.29 86.43
12/29/18 East Parent Pen Set 74 15.99 1,183.26
1/15/19 Central Gill Binder 46 8.99 413.54
2/1/19 Central Smith Binder 87 15.00 1,305.00
2/18/19 East Jones Binder 4 4.99 19.96
3/7/19 West Sorvino Binder 7 19.99 139.93
3/24/19 Central Jardine Pen Set 50 4.99 249.50
4/10/19 Central Andrews Pencil 66 1.99 131.34
4/27/19 East Howard Pen 96 4.99 479.04
5/14/19 Central Gill Pencil 53 1.29 68.37
5/31/19 Central Gill Binder 80 8.99 719.20
6/17/19 Central Kivell Desk 5 125.00 625.00
7/4/19 East Jones Pen Set 62 4.99 309.38
7/21/19 Central Morgan Pen Set 55 12.49 686.95
8/7/19 Central Kivell Pen Set 42 23.95 1,005.90
8/24/19 West Sorvino Desk 3 275.00 825.00
9/10/19 Central Gill Pencil 7 1.29 9.03
9/27/19 West Sorvino Pen 76 1.99 151.24
10/14/19 West Thompson Binder 57 19.99 1,139.43
10/31/19 Central Andrews Pencil 14 1.29 18.06
11/17/19 Central Jardine Binder 11 4.99 54.89
12/4/19 Central Jardine Binder 94 19.99 1,879.06
12/21/19 Central Andrews Binder 28 4.99 139.72
我想获得如下分布:
Region: { "central" : 0.558,
"west" : 0.139,
"east" : 0.303
}
这意味着中心区域是区域列中数据的55.8%。 我怎么才能得到它? 最后,我想将所有内容导出到excel文件。
import os
import pandas as pd
def get_ddl(df):
ddl=pd.io.sql.get_schema(df.reset_index(),'table1')
return ddl
def get_columns(df):
list=[]
for col in df.columns:
list.append(col)
return list
def distrebution(df,column):
index = df.groupby(column).count()
return index
def create_dict(excel_path,sheet_name):
dict={}
i=0
xls = pd.ExcelFile(excel_path)
df1 = pd.read_excel(xls, sheet_name)
columns_list=get_columns(df1)
while (i<len(columns_list)):
dict.update({columns_list[i] : [df1[columns_list[i]].min(),df1[columns_list[i]].max()]})
i=i+1
return dict
# xls = pd.ExcelFile('/home/sqream/SampleData.xlsx')
# df1 = pd.read_excel(xls, 'SalesOrders')
# # print(df1)
# df=pd.read_excel('/home/sqream/SampleData.xlsx')
# print(df)
xls = pd.ExcelFile('/home/sqream/SampleData.xlsx')
df1 = pd.read_excel(xls, 'SalesOrders')
x=get_ddl(df1)
dict=create_dict('/home/sqream/SampleData.xlsx','SalesOrders')
res=pd.DataFrame(dict)
res.rename(index={0:'min',1:'max',2: 'distrebution_of_values'}, inplace=True)
print(res)
res.to_excel('/home/sqream/df1.xlsx')
rows_num=(len(df1.index))
x=distrebution(df1,'Region')
d=x.to_dict()
print(d)
答案 0 :(得分:0)
只需使用value_counts():
df.Region.value_counts(normalize=True)
Central 0.558140
East 0.302326
West 0.139535
Name: Region, dtype: float64