Question

我的数据框看起来像：

Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264

简单地说，我需要添加另一个名为＆＃39; _id＆＃39;的列。作为商店，部门，日期和＃34; 1_1_2010-02-05＆＃34;的连接，我假设我可以通过df [＆＃39; id＆＃39;] = df [＆＃39;商店＆＃39;] +＆＃39; ＆＃39; + df [＆＃39; Dept＆＃39;] +＆＃39; _＆＃39; + df [＆＃39; Date＆＃39;]，但事实证明并非如此。

同样，我还需要添加一个新列作为销售日志，我再次尝试了df [＆＃39; logSales＆＃39;] = math.log（df [＆＃39; Sales＆＃39;]），它不起作用。

Answer 1

在与+连接之前，您可以先将其转换为字符串（整数列）：

In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']

In [26]: df
Out[26]: 
   Store  Dept        Date  Sales              id
0      1     1  2010-02-05    245  1_1_2010-02-05
1      1     1  2010-02-12    449  1_1_2010-02-12
2      1     1  2010-02-19    455  1_1_2010-02-19
3      1     1  2010-02-26    154  1_1_2010-02-26
4      1     1  2010-03-05     29  1_1_2010-03-05
5      1     1  2010-03-12    239  1_1_2010-03-12
6      1     1  2010-03-19    264  1_1_2010-03-19

对于log，您最好使用numpy功能。这是矢量化的（math.log只能处理单个标量值）：

In [34]: df['logSales'] = np.log(df['Sales'])

In [35]: df
Out[35]: 
   Store  Dept        Date  Sales              id  logSales
0      1     1  2010-02-05    245  1_1_2010-02-05  5.501258
1      1     1  2010-02-12    449  1_1_2010-02-12  6.107023
2      1     1  2010-02-19    455  1_1_2010-02-19  6.120297
3      1     1  2010-02-26    154  1_1_2010-02-26  5.036953
4      1     1  2010-03-05     29  1_1_2010-03-05  3.367296
5      1     1  2010-03-12    239  1_1_2010-03-12  5.476464
6      1     1  2010-03-19    264  1_1_2010-03-19  5.575949

总结评论，对于这个大小的数据框，使用apply与使用矢量化函数（处理完整列）相比，性能差异不大，但是当您的真实数据框变大时，它会。
除此之外，我认为上述解决方案也更容易语法化。

Answer 2

In [153]:
import pandas as pd
import io

temp = """Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
   Store  Dept        Date  Sales
0      1     1  2010-02-05    245
1      1     1  2010-02-12    449
2      1     1  2010-02-19    455
3      1     1  2010-02-26    154
4      1     1  2010-03-05     29
5      1     1  2010-03-12    239
6      1     1  2010-03-19    264

[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x:  str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
   Store  Dept        Date  Sales              id
0      1     1  2010-02-05    245  1 1_2010-02-05
1      1     1  2010-02-12    449  1 1_2010-02-12
2      1     1  2010-02-19    455  1 1_2010-02-19
3      1     1  2010-02-26    154  1 1_2010-02-26
4      1     1  2010-03-05     29  1 1_2010-03-05
5      1     1  2010-03-12    239  1 1_2010-03-12
6      1     1  2010-03-19    264  1 1_2010-03-19

[7 rows x 5 columns]
In [155]:

import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
   Store  Dept        Date  Sales              id  logSales
0      1     1  2010-02-05    245  1 1_2010-02-05  5.501258
1      1     1  2010-02-12    449  1 1_2010-02-12  6.107023
2      1     1  2010-02-19    455  1 1_2010-02-19  6.120297
3      1     1  2010-02-26    154  1 1_2010-02-26  5.036953
4      1     1  2010-03-05     29  1 1_2010-03-05  3.367296
5      1     1  2010-03-12    239  1 1_2010-03-12  5.476464
6      1     1  2010-03-19    264  1 1_2010-03-19  5.575949

[7 rows x 6 columns]

Python Pandas在行上运行

2 个答案: