我正在尝试将我的旧学校代码更改为优雅/快速的Pandas代码,例如上一个问题:
Rolling operation slow performance to create a new column
我有4种不同的代码,希望通过使用Pandas的优雅而快速的代码来提高性能。
原始数据帧(df)与此相似(尽管更大):
allowed_domains
它看起来像这样:
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 1}, {'x': 5}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3},{'x': 1}, {'x': 6}, {'x': 3}, {'x': 5}, {'x': 2}, {'x': 3},{'x': 1}, {'x': 9}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3}, {'x': 6}, {'x': 8}, {'x': 2}, {'x': 7}, {'x': 9}]
df = pd.DataFrame(data, index=idx, columns=['x'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])
我的目标是改善这种缓慢的for循环代码。在我的代码下面:
x
date type
2018-01-01 A 1
B 5
2018-02-01 B 3
2018-03-01 A 2
B 7
C 3
2018-04-01 A 1
B 6
2018-05-01 B 3
2018-06-01 A 5
B 2
C 3
2018-07-01 A 1
2018-08-01 B 9
C 3
2018-09-01 A 2
B 7
2018-10-01 C 3
A 6
B 8
2018-11-01 A 2
2018-12-01 B 7
C 9
结果如下:
df=df.reset_index()
df['y']=0
for j in df['date'].unique():
list_1=list(df['type'][df['date']==j].index)
df['y'][list_1]=np.mean(df['x'][df['date']==j])
**我尝试了以下Pandas代码,但没有用(我仍然需要查看更多示例以了解其工作原理):
date type x y
0 2018-01-01 A 1 3.000000
1 2018-01-01 B 5 3.000000
2 2018-02-01 B 3 3.000000
3 2018-03-01 A 2 4.000000
4 2018-03-01 B 7 4.000000
5 2018-03-01 C 3 4.000000
6 2018-04-01 A 1 3.500000
7 2018-04-01 B 6 3.500000
8 2018-05-01 B 3 3.000000
9 2018-06-01 A 5 3.333333
10 2018-06-01 B 2 3.333333
11 2018-06-01 C 3 3.333333
12 2018-07-01 A 1 1.000000
13 2018-08-01 B 9 6.000000
14 2018-08-01 C 3 6.000000
15 2018-09-01 A 2 4.500000
16 2018-09-01 B 7 4.500000
17 2018-10-01 C 3 5.666667
18 2018-10-01 A 6 5.666667
19 2018-10-01 B 8 5.666667
20 2018-11-01 A 2 2.000000
21 2018-12-01 B 7 8.000000
22 2018-12-01 C 9 8.000000
我的目标是衡量每个日期的类型数。
我的慢速代码是:
df['y'] = df.groupby('date')['x'].mean().reset_index(level=2, drop=True).swaplevel(0,1)
结果如下:
df=df.reset_index()
df['y']=0
for j in df['date'].unique():
list_1=list(df['type'][df['date']==j].index)
df['y'][list_1]=len(df['type'][df['date']==j])
我的目标是测量每个日期的A型数量。
我的慢速代码如下:
date type x y
0 2018-01-01 A 1 2
1 2018-01-01 B 5 2
2 2018-02-01 B 3 1
3 2018-03-01 A 2 3
4 2018-03-01 B 7 3
5 2018-03-01 C 3 3
6 2018-04-01 A 1 2
7 2018-04-01 B 6 2
8 2018-05-01 B 3 1
9 2018-06-01 A 5 3
10 2018-06-01 B 2 3
11 2018-06-01 C 3 3
12 2018-07-01 A 1 1
13 2018-08-01 B 9 2
14 2018-08-01 C 3 2
15 2018-09-01 A 2 2
16 2018-09-01 B 7 2
17 2018-10-01 C 3 3
18 2018-10-01 A 6 3
19 2018-10-01 B 8 3
20 2018-11-01 A 2 1
21 2018-12-01 B 7 2
22 2018-12-01 C 9 2
它看起来像这样:
df=df.reset_index()
df['z']=0
df['y']=0
for index,row in df.iterrows():
if row['type']=='A':
df['z'][index]=1
else:
df['z'][index]=0
for j in df['date'].unique():
list_1=list(df['type'][df['date']==j].index)
df['y'][list_1]=sum(df['z'][df['date']==j])
del df['z']
请使用以下数据框(df1):
date type x y
0 2018-01-01 A 1 1
1 2018-01-01 B 5 1
2 2018-02-01 B 3 0
3 2018-03-01 A 2 1
4 2018-03-01 B 7 1
5 2018-03-01 C 3 1
6 2018-04-01 A 1 1
7 2018-04-01 B 6 1
8 2018-05-01 B 3 0
9 2018-06-01 A 5 1
10 2018-06-01 B 2 1
11 2018-06-01 C 3 1
12 2018-07-01 A 1 1
13 2018-08-01 B 9 0
14 2018-08-01 C 3 0
15 2018-09-01 A 2 1
16 2018-09-01 B 7 1
17 2018-10-01 C 3 1
18 2018-10-01 A 6 1
19 2018-10-01 B 8 1
20 2018-11-01 A 2 1
21 2018-12-01 B 7 0
22 2018-12-01 C 9 0
此数据帧(df1)如下:
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 10, 'z': 1}, {'x': 50, 'z': 0}, {'x': 30, 'z': 0}, {'x': 20, 'z': 0}, {'x': 70, 'z': 1}, {'x': 30, 'z': 1},{'x': 10, 'z': 1}, {'x': 60, 'z': 0}, {'x': 30, 'z': 0}, {'x': 50, 'z': 1}, {'x': 20, 'z': 0}, {'x': 30, 'z': 1},{'x': 10, 'z': 0}, {'x': 90, 'z': 1}, {'x': 30, 'z': 1}, {'x': 20, 'z': 1}, {'x': 70, 'z': 0}, {'x': 30, 'z': 0}, {'x': 60, 'z': 1}, {'x': 80, 'z': 1}, {'x': 20, 'z': 0}, {'x': 70, 'z': 0}, {'x': 90, 'z': 1}]
df1 = pd.DataFrame(data, index=idx, columns=['x','z'])
df1.index.names=['date','type']
df1=df1.reset_index()
df1['date'] = pd.to_datetime(df1['date'],format = '%b-%y')
df1=df1.set_index(['date','type'])
我的慢速代码是:
x z
date type
2018-01-01 A 10 1
B 50 0
2018-02-01 B 30 0
2018-03-01 A 20 0
B 70 1
C 30 1
2018-04-01 A 10 1
B 60 0
2018-05-01 B 30 0
2018-06-01 A 50 1
B 20 0
C 30 1
2018-07-01 A 10 0
2018-08-01 B 90 1
C 30 1
2018-09-01 A 20 1
B 70 0
2018-10-01 C 30 0
A 60 1
B 80 1
2018-11-01 A 20 0
2018-12-01 B 70 0
C 90 1
它看起来像这样:
df1=df1.reset_index()
df1['h']=0
df1['k']=0
df1['y']=0
for index,row in df1.iterrows():
if row['type']=='A':
df1['h'][index]=1
else:
df1['h'][index]=0
for index,row in df1.iterrows():
if row['z']==1 and row['h']==1:
df1['k'][index]=1
else:
df1['k'][index]=0
for j in df1['date'].unique():
list_1=list(df1['type'][df1['date']==j].index)
df1['y'][list_1]=sum(df1['k'][df1['date']==j])
del df1['h']
del df1['k']
尽管问题似乎很久,但我知道Pandas的答案可以在几行代码中。如果您可以让我知道您的代码比我的代码快多少,那将非常有用。
答案 0 :(得分:3)
您正在寻找groupby
+ transform
。这里的.transform
是关键,因为它将结果广播回原始DataFrame中属于该组的所有行。
首先,我们可以简单地在x
上进行转换。
对于唯一类型,将其带入列要比处理索引要快,因此assign
是一列并计算组内唯一值的数量。
对于最后两个条件,您可以创建一个布尔列,以查看该行是否满足条件,并在组内求和。
#1 Get the mean of `x` by date
df['x_avg'] = df.groupby('date').x.transform('mean')
#2 Get the # of unique types.
df['N'] = (df.assign(TYPE = df.index.get_level_values('type'))
.groupby('date').TYPE.transform('nunique'))
#3 Get the number of Type == A within a group
df['num_A'] = (df.assign(eqA = (df.index.get_level_values('type') == 'A'))
.groupby('date').eqA.transform(sum).astype(int))
#4 Really just a slight extension of 3
df1['cond_4'] = (df1.assign(to_sum = ((df1.index.get_level_values('type') == 'A')
& (df1.z == 1)).astype(int))
.groupby('date').to_sum.transform(sum))
df
: x num_A x_avg N
date type
2018-01-01 A 1 1 3.000000 2
B 5 1 3.000000 2
2018-02-01 B 3 0 3.000000 1
2018-03-01 A 2 1 4.000000 3
B 7 1 4.000000 3
C 3 1 4.000000 3
2018-04-01 A 1 1 3.500000 2
B 6 1 3.500000 2
2018-05-01 B 3 0 3.000000 1
2018-06-01 A 5 1 3.333333 3
B 2 1 3.333333 3
C 3 1 3.333333 3
2018-07-01 A 1 1 1.000000 1
2018-08-01 B 9 0 6.000000 2
C 3 0 6.000000 2
2018-09-01 A 2 1 4.500000 2
B 7 1 4.500000 2
2018-10-01 C 3 1 5.666667 3
A 6 1 5.666667 3
B 8 1 5.666667 3
2018-11-01 A 2 1 2.000000 1
2018-12-01 B 7 0 8.000000 2
C 9 0 8.000000 2
答案 1 :(得分:1)
对于第一种情况,您可以尝试以下操作:
bool IsStringsEqual(wchar_t* str1, wchar_t* str2)
{
wchar_t buf1[MAX_STRING], buf2[MAX_STRING];
NormalizeString(NormalizationKD, str1, -1, buf1, MAX_STRING);
NormalizeString(NormalizationKD, str2, -1, buf2, MAX_STRING);
return wcscmp(buf1, buf2) == 0;
}
对于最后一种情况:
df['y'] = df.groupby('date')['x'].transform(np.mean)