Python Pandas在行上运行

时间:2014-05-30 08:56:45

标签: python pandas dataframe

我的数据框看起来像:

Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264

简单地说,我需要添加另一个名为' _id'的列。作为商店,部门,日期和#34; 1_1_2010-02-05"的连接,我假设我可以通过df [' id'] = df ['商店'] +' ' + df [' Dept'] +' _' + df [' Date'],但事实证明并非如此。

同样,我还需要添加一个新列作为销售日志,我再次尝试了df [' logSales'] = math.log(df [' Sales']) ,它不起作用。

2 个答案:

答案 0 :(得分:3)

在与+连接之前,您可以先将其转换为字符串(整数列):

In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']

In [26]: df
Out[26]: 
   Store  Dept        Date  Sales              id
0      1     1  2010-02-05    245  1_1_2010-02-05
1      1     1  2010-02-12    449  1_1_2010-02-12
2      1     1  2010-02-19    455  1_1_2010-02-19
3      1     1  2010-02-26    154  1_1_2010-02-26
4      1     1  2010-03-05     29  1_1_2010-03-05
5      1     1  2010-03-12    239  1_1_2010-03-12
6      1     1  2010-03-19    264  1_1_2010-03-19

对于log,您最好使用numpy功能。这是矢量化的(math.log只能处理单个标量值):

In [34]: df['logSales'] = np.log(df['Sales'])

In [35]: df
Out[35]: 
   Store  Dept        Date  Sales              id  logSales
0      1     1  2010-02-05    245  1_1_2010-02-05  5.501258
1      1     1  2010-02-12    449  1_1_2010-02-12  6.107023
2      1     1  2010-02-19    455  1_1_2010-02-19  6.120297
3      1     1  2010-02-26    154  1_1_2010-02-26  5.036953
4      1     1  2010-03-05     29  1_1_2010-03-05  3.367296
5      1     1  2010-03-12    239  1_1_2010-03-12  5.476464
6      1     1  2010-03-19    264  1_1_2010-03-19  5.575949

总结评论,对于这个大小的数据框,使用apply与使用矢量化函数(处理完整列)相比,性能差异不大,但是当您的真实数据框变大时,它会。
除此之外,我认为上述解决方案也更容易语法化。

答案 1 :(得分:2)

In [153]:
import pandas as pd
import io

temp = """Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
   Store  Dept        Date  Sales
0      1     1  2010-02-05    245
1      1     1  2010-02-12    449
2      1     1  2010-02-19    455
3      1     1  2010-02-26    154
4      1     1  2010-03-05     29
5      1     1  2010-03-12    239
6      1     1  2010-03-19    264

[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x:  str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
   Store  Dept        Date  Sales              id
0      1     1  2010-02-05    245  1 1_2010-02-05
1      1     1  2010-02-12    449  1 1_2010-02-12
2      1     1  2010-02-19    455  1 1_2010-02-19
3      1     1  2010-02-26    154  1 1_2010-02-26
4      1     1  2010-03-05     29  1 1_2010-03-05
5      1     1  2010-03-12    239  1 1_2010-03-12
6      1     1  2010-03-19    264  1 1_2010-03-19

[7 rows x 5 columns]
In [155]:

import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
   Store  Dept        Date  Sales              id  logSales
0      1     1  2010-02-05    245  1 1_2010-02-05  5.501258
1      1     1  2010-02-12    449  1 1_2010-02-12  6.107023
2      1     1  2010-02-19    455  1 1_2010-02-19  6.120297
3      1     1  2010-02-26    154  1 1_2010-02-26  5.036953
4      1     1  2010-03-05     29  1 1_2010-03-05  3.367296
5      1     1  2010-03-12    239  1 1_2010-03-12  5.476464
6      1     1  2010-03-19    264  1 1_2010-03-19  5.575949

[7 rows x 6 columns]