在另一列的基础上计算一列中的重复值

时间:2019-02-14 00:24:10

标签: python pandas

使用Panda,我正在处理以下CSV数据类型:

f,f,f,f,f,t,f,f,f,t,f,t,g,f,n,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,nowin
t,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,nowin
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,win

对于这部分原始数据,我试图返回类似的内容:

Column1_name -- t -- counts of nowin = 0

Column1_name -- t -- count of wins = 3

Column1_name -- f -- count of nowin = 2 

Column1_name -- f -- count of win = 1

基于这个想法get dataframe row count based on conditions,我正在考虑做这样的事情:

print(df[df.target == 'won'].count())

但是,这将基于最后一列始终返回相同数量的“韩元”,而不考虑此列是“ f”还是“ t”。在其他情况下,我希望从Panda数据框工作中使用某些东西,以产生来自SQL的“分组依据”的想法,例如基于第一列和最后一列进行分组。

我应该一直坚持这样的想法吗?我应该简单地开始使用for循环吗?

如果需要,请提供我其余的代码:

import pandas as pd


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data"

df = pd.read_csv(url,names=[
                       'bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
                        'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
                        'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target'
                        ])


features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
        'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
        'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target']


# number of lines 
#tot_of_records = np.size(my_data,0) 
#tot_of_records = np.unique(my_data[:,1])

#for item in my_data:
#    item[:,0]
num_of_won=0
num_of_nowin=0

for item in df.target:
    if item == 'won':
        num_of_won = num_of_won + 1
    else:
        num_of_nowin = num_of_nowin + 1

print(num_of_won)
print(num_of_nowin)        

print(df[df.target == 'won'].count())  

#print(df[:1])
#print(df.bkblk.to_string(index=False))
#print(df.target.unique())
#ini_entropy = (() + ())

1 个答案:

答案 0 :(得分:1)

这可能有效-

outdf = df.apply(lambda x: pd.crosstab(index=df.target,columns=x).to_dict())

基本上,我们要进入每个功能列,并与目标列建立交叉表

enter image description here

希望这会有所帮助! :)