我的熊猫数据框“火车”为
Name Comb Sales
Joy A123 102
John A134 112
Aby A123 140
Amit A123 190
Andrew A134 210
Pren A123 109
Abry A134 230
Hulk A134 188
...
对于每个唯一的梳子,我想找到相应销售额的25%的分位数并创建各自的垃圾箱。例如,如果您为Comb ='A123'的销售额创建25%的分位数垃圾箱,您将获得(102.00 107.25 124.50 152.50 190.00)。现在,我要使用这些分位数将所有Comb ='A123'的销售额进行分类。我得到的数据将是
Name Comb Sales Bin Bin_Low Bin_High
Joy A123 102 1 102 107.25
John A134 112 1 112 169
Aby A123 140 3 124.50 152.50
Amit A123 190 4 152.90 190
Andrew A134 210 3 199 215
Pren A123 109 2 107.25 124.50
Abry A134 230 4 215 230
Hulk A134 188 2 169 199
我创建了以下代码,但最终的数据帧格式不正确。
quant = pd.DataFrame()
i = ''
for i in train.comb.unique():
a=pd.qcut(train[train.comb == i ]['Sales'], 4,duplicates='drop')
df = pd.DataFrame(np.array(a))
comp=pd.concat([train[train.combo == i ],df], axis=1)
quant=quant.append(comp)
任何帮助将不胜感激。
答案 0 :(得分:1)
您可以在数据框上使用socket = /run/mysqld/mysqld.sock
skip-external-locking
key_buffer_size = 16M
max_allowed_packet = 1M
table_open_cache = 64
sort_buffer_size = 512K
net_buffer_length = 8K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 8M
slave-net-timeout = 30
binlog_ignore_db = mysql
binlog_ignore_db = zoom
binlog_ignore_db = performance_schema
binlog_ignore_db = information_schema
binlog_do_db = TK09
replicate_do_db = TK09
binlog_ignore_db = TK09_user
log-bin=binlog
log-slave-updates=1
binlog_format=mixed
innodb_buffer_pool_size = 2G
innodb_buffer_pool_instances = 8
innodb_log_buffer_size = 8M
query_cache_size = 40M
[mysqldump]
quick
max_allowed_packet = 16M
[mysql]
no-auto-rehash
[myisamchk]
key_buffer_size = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M
[mysqlhotcopy]
interactive-timeout
,并按qcut
分组。然后,将左侧分配给Comb
列,将右侧分配给Bin_low
。请注意,qcut在Bin_max
端有一个开放时间间隔,因此这些值将比您期望的输出稍差一点,但本质上是相同的:
left