我有一个如下数据帧
df
:
ID color finish duration
A1 black smooth 12
A2 white matte 8
A3 blue smooth 20
A4 green matte 10
B1 black smooth 12
B2 white matte 8
B3 blue smooth
B4 green 10
C1 black smooth
C2 white matte 8
C3 blue smooth
C4 green 10
我想根据特定条件生成此数据框的子集。例如,
color= black
,finish = smooth
,duration = 12
,我得到以下数据框。
ID color finish duration score
A1 black smooth 12 1
B1 black smooth 12 1
color= blue
,finish = smooth
,duration = 20
,我得到以下数据框。
ID color finish duration score
A3 blue smooth 20 1
B3 blue smooth 0.666667
C3 blue smooth 0.666667
分数计算为填充的列数/总列数。 我想在pandas数据帧中循环它。 以下代码适用于我的2列。
list2 = list(df['color'].unique())
list3 = list(df['finish'].unique())
df_final = pd.DataFrame()
for i in range(len(list2)):
for j in range(len(list3)):
print 'Current Attribute Value:',list2[i],list3[j]
gbl["df_"+list2[i]] = df[df.color == list2[i]]
gbl["df_" + list2[i] + list3[j]] =
gbl["df_"+list2[i]].loc[gbl["df_"+list2[i]].finish == list3[j]]
gbl["df_" + list2[i] + list3[j]]['dfattribval'] = list2[i] + list3[j]
df_final = df_final.append(gbl["df_" + list2[i] + list3[j]], ignore_index=True)
但是,我无法将其循环到列名称上。我想做的是,
lista = ['color','finish']
df_final = pd.DataFrame()
for a in range(len(lista)):
for i in range(len(list2)):
for j in range(len(list3)):
print 'Current Attribute Value:',lista[a],list2[i],lista[a+1],list3[j]
gbl["df_"+list2[i]] = df[df.lista[a] == list2[i]]
gbl["df_" + list2[i] + list3[j]] = gbl["df_"+list2[i]].loc[gbl["df_"+list2[i]].lista[a+1] == list3[j]]
gbl["df_" + list2[i] + list3[j]]['dfattribval'] = list2[i] + list3[j]
df_final = df_final.append(gbl["df_" + list2[i] + list3[j]], ignore_index=True)
我得到了明显的错误 -
AttributeError:' DataFrame'对象没有属性' lista'。
任何人都知道如何遍历列名和值。非常感谢!
答案 0 :(得分:1)
不太确定您的需求,但考虑使用列表解析来排列列表以避免嵌套循环并使用数据帧字典。可能会调整scorecalc()
apply函数以满足您的需求:
colorlist = list(df['color'].unique())
finishlist = list(df['finish'].unique())
durationlist = list(df['duration'].unique())
# ALL COMBINATIONS BETWEEN LISTS
allList = [(c,f, d) for c in colorlist for f in finishlist for d in durationlist]
def scorecalc(row):
row['score'] = row['duration'].count()
return(row)
dfList = []; dfDict = {}
for i in allList:
# SUBSET DFS
tempdf = df[(df['color'] == i[0]) & (df['finish']==i[1]) & (df['duration']==i[2])]
if len(tempdf) > 0: # FOR NON-EMPTY DFS
print('Current Attribute Value:', i[0], i[1], i[2])
tempdf = tempdf.groupby(['color','finish']).apply(scorecalc)
tempdf['score'] = tempdf['score'] / len(tempdf)
print(tempdf)
key = str(i[0]) + str(i[1]) + str(i[2])
dfDict[key] = tempdf # DICTIONARY OF DFS (USE pd.DataFrame(list(...)) FOR FINAL)
dfList.append(tempdf) # LIST OF DFS (USE pd.concat() FOR FINAL DF)
# Current Attribute Value: black smooth 12.0
# ID color finish duration score
#0 A1 black smooth 12.0 1.0
#4 B1 black smooth 12.0 1.0
#Current Attribute Value: white matte 8.0
# ID color finish duration score
#1 A2 white matte 8.0 1.0
#5 B2 white matte 8.0 1.0
#9 C2 white matte 8.0 1.0
#Current Attribute Value: blue smooth 20.0
# ID color finish duration score
#2 A3 blue smooth 20.0 1.0
#Current Attribute Value: green matte 10.0
# ID color finish duration score
#3 A4 green matte 10.0 1.0