我有一个DataFrame,其中包含有关员工薪水的信息。大约有900000多行。
示例:
+----+-------------+---------------+----------+
| | table_num | name | salary |
|----+-------------+---------------+----------|
| 0 | 001234 | John Johnson | 1200 |
| 1 | 001234 | John Johnson | 1000 |
| 2 | 001235 | John Johnson | 1000 |
| 3 | 001235 | John Johnson | 1200 |
| 4 | 001235 | John Johnson | 1000 |
| 5 | 001235 | Steve Stevens | 1000 |
| 6 | 001236 | Steve Stevens | 1200 |
| 7 | 001236 | Steve Stevens | 1200 |
| 8 | 001236 | Steve Stevens | 1200 |
+----+-------------+---------------+----------+
dtypes:
table_num: string
name: string
salary: float
我需要添加一列有关薪资水平提高/降低的信息。
我正在使用shift()
函数比较行中的值。
主要问题在于整个数据集中所有唯一雇员的过滤和迭代。
我的脚本大约需要3个半小时。
如何更快地做到这一点?
我的脚本:
# giving us only unique combination of 'table_num' and 'name'
# since there can be same 'table_num' for different 'name'
# and same names with different 'table_num' appears sometimes
names_df = df[['table_num', 'name']].drop_duplicates()
# then extracting particular name and table_num from Series
for i in range(len(names_df)): ### Bottleneck of whole script ###
t = names_df.iloc[i,[0,1]][0]
n = names_df.iloc[i,[0,1]][1]
# using shift() and lambda to check if there difference between two rows
diff_sal = (df[(df['table_num']==t)
& ((df['name']==n))]['salary'] - df[(df['table_num']==t)
& ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
df.loc[diff_sal.index, 'inc'] = diff_sal.values
样本输入数据:
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
示例输出:
+----+-------------+---------------+----------+-------+
| | table_num | name | salary | inc |
|----+-------------+---------------+----------+-------|
| 0 | 001234 | John Johnson | 1200 | 0 |
| 1 | 001234 | John Johnson | 1000 | -1 |
| 2 | 001235 | John Johnson | 1000 | 0 |
| 3 | 001235 | John Johnson | 1200 | 1 |
| 4 | 001235 | John Johnson | 1000 | -1 |
| 5 | 001235 | Steve Stevens | 1000 | 0 |
| 6 | 001236 | Steve Stevens | 1200 | 0 |
| 7 | 001236 | Steve Stevens | 1200 | 0 |
| 8 | 001236 | Steve Stevens | 1200 | 0 |
+----+-------------+---------------+----------+-------+
答案 0 :(得分:5)
df['inc'] = df.groupby(['table_num', 'name'])['salary'].diff().fillna(0.0)
df.loc[df['inc'] > 0.0, 'inc'] = 1.0
df.loc[df['inc'] < 0.0, 'inc'] = -1.0
答案 1 :(得分:2)
将DataFrameGroupBy.diff
与numpy.sign
一起使用,并最后投射到integer
s:
df['new'] = np.sign(df.groupby(['table_num', 'name'])['salary'].diff().fillna(0)).astype(int)
print (df)
table_num name salary new
0 1234 John Johnson 1200 0
1 1234 John Johnson 1000 -1
2 1235 John Johnson 1000 0
3 1235 John Johnson 1200 1
4 1235 John Johnson 1000 -1
5 1235 Steve Stevens 1000 0
6 1236 Steve Stevens 1200 0
7 1236 Steve Stevens 1200 0
8 1236 Steve Stevens 1200 0
答案 2 :(得分:1)
shift()
是解决之道,但您应尽可能避免使用循环。在这里,我们可以利用groupby()
和transform()
的力量。检查熊猫docs。
您可以通过以下方式做到:
df.assign(inc=lambda x: x.groupby(['name'])
.salary
.transform(lambda y: y - y.shift(1))
.apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
)
产量:
table_num name salary inc
0 001234 John Johnson 1200.0 0
1 001234 John Johnson 1000.0 -1
2 001235 John Johnson 1000.0 0
3 001235 John Johnson 1200.0 1
4 001235 John Johnson 1000.0 -1
5 001235 Steve Stevens 1000.0 0
6 001236 Steve Stevens 1200.0 1
7 001236 Steve Stevens 1200.0 0
8 001236 Steve Stevens 1200.0 0
答案 3 :(得分:0)
我认为您可以搜索以下术语:“熊猫矢量化”以加快数据框的操作速度,对于您的问题,您可以尝试以下操作:
import pandas as pd
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
df['temp'] = df['name'] + df['table_num']
df.sort_values('temp', inplace=True)
df['diff'] = df.groupby('temp')['salary'].diff()
df['diff'] = (df['diff'] / abs(df['diff'])).fillna(0)