我有以下两个数据框。
import pandas as pd
data = [[1, 'NEW'], [2, 'OLD'], [3, 'OLD'],[4, 'OLD']]
df1 = pd.DataFrame(data, columns = ['ID', 'Age'])
df2 = pd.DataFrame({'ID' : [[1,2,3], [2,3],[1,3,4], [2,3]]})
print(df1)
print(df2)
ID Age
0 1 NEW
1 2 OLD
2 3 OLD
3 4 OLD
ID
0 [1, 2, 3]
1 [2, 3]
2 [1, 3, 4]
3 [2, 3]
我正在尝试获取“ NEW” ID的百分比,并将其作为新列添加到df2。我正在使用以下功能,它工作正常。但是,这对于大型数据帧而言似乎并不高效。我想知道是否有更有效的/ pythonic方法来做到这一点?
def id_list(x):
ttl=0
for i in x:
if df1.loc[df1.ID == int(i), 'Age'].iloc[0] == 'NEW':
ttl = ttl+1
return ttl/len(x)
df2['percentage']=df2.ID.apply(id_list)
df2
ID percentage
0 [1, 2, 3] 0.333333
1 [2, 3] 0.000000
2 [1, 3, 4] 0.333333
3 [2, 3] 0.000000
答案 0 :(得分:1)
这可以通过explode
和groupby
来完成:
df2['percentage'] = (df2.ID.explode() # flatten `ID` column
.map(df1.set_index('ID').Age) # map ID to `Age` label
.eq('NEW') # compare with the label of interest
.groupby(level=0).mean()
)
输出:
ID percentage
0 [1, 2, 3] 0.333333
1 [2, 3] 0.000000
2 [1, 3, 4] 0.333333
3 [2, 3] 0.000000
答案 1 :(得分:1)
与Quang几乎是相同的想法,explode
,然后对mean
做level
df2.ID.explode().map(df1.set_index('ID').Age).eq('NEW').astype(int).mean(level=0)
0 0.333333
1 0.000000
2 0.333333
3 0.000000
Name: ID, dtype: float64
df['New Ave']=df2.ID.explode().map(df1.set_index('ID').Age).eq('NEW').astype(int).mean(level=0)
答案 2 :(得分:0)
尝试:
import numpy as np
new_=set(df1.loc[df1['Age'].eq('NEW'), 'ID'].tolist())
df2['percentage']=df2['ID'].map(set).agg(lambda x: len(np.bitwise_and(x, new_))/len(x))
输出:
ID percentage
0 [1, 2, 3] 0.333333
1 [2, 3] 0.000000
2 [1, 3, 4] 0.333333
3 [2, 3] 0.000000