我有购买数据并希望用新列标记它们,该列提供有关购买白天的信息。为此,我使用每次购买的时间戳列的小时。
标签应该像这样工作:
#include <iostream>
using namespace std;
void a();
void b();
int main ()
{
a();
}
void a()
{
int a=5;
int n[a];
for(int i=0;i<a;i++)
{
cin>>n[i];
}
}
void b()
{
}
我已经选择了时间戳的小时数。现在,我有一个包含50 mio记录的DataFrame,如下所示。
hour 4 - 7 => 'morning'
hour 8 - 11 => 'before midday'
...
目前我的方法是使用6x .iterrows(),每个都有不同的条件:
user_id timestamp hour
0 11 2015-08-21 06:42:44 6
1 11 2015-08-20 13:38:58 13
2 11 2015-08-20 13:37:47 13
3 11 2015-08-21 06:59:05 6
4 11 2015-08-20 13:15:21 13
然后:
for index, row in basket_times[(basket_times['hour'] >= 4) & (basket_times['hour'] < 8)].iterrows():
basket_times['periode'] = 'morning'
等等。
然而,50个mio记录的6个循环中的一个已经花费了一个小时。有更好的方法吗?
答案 0 :(得分:1)
您可以使用布尔掩码尝试loc
。我更改了df
进行测试:
print basket_times
user_id timestamp hour
0 11 2015-08-21 06:42:44 6
1 11 2015-08-20 13:38:58 13
2 11 2015-08-20 09:37:47 9
3 11 2015-08-21 06:59:05 6
4 11 2015-08-20 13:15:21 13
#create boolean masks
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11)
aftermidday = (basket_times['hour'] >= 11) & (basket_times['hour'] < 15)
print morning
0 True
1 False
2 False
3 True
4 False
Name: hour, dtype: bool
print beforemidday
0 False
1 False
2 True
3 False
4 False
Name: hour, dtype: bool
print aftermidday
0 False
1 True
2 False
3 False
4 True
Name: hour, dtype: bool
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
basket_times.loc[aftermidday, 'periode'] = 'after midday'
print basket_times
user_id timestamp hour periode
0 11 2015-08-21 06:42:44 6 morning
1 11 2015-08-20 13:38:58 13 after midday
2 11 2015-08-20 09:37:47 9 before midday
3 11 2015-08-21 06:59:05 6 morning
4 11 2015-08-20 13:15:21 13 after midday
计时 - len(df) = 500k
:
In [87]: %timeit a(df)
10 loops, best of 3: 34 ms per loop
In [88]: %timeit b(df1)
1 loops, best of 3: 490 ms per loop
测试代码:
import pandas as pd
import io
temp=u"""user_id;timestamp;hour
11;2015-08-21 06:42:44;6
11;2015-08-20 10:38:58;10
11;2015-08-20 09:37:47;9
11;2015-08-21 06:59:05;6
11;2015-08-20 10:15:21;10"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None, parse_dates=[1])
df = pd.concat([df]*100000).reset_index(drop=True)
print df.shape
#(500000, 3)
df1 = df.copy()
def a(basket_times):
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11)
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
return basket_times
def b(basket_times):
def get_periode(hour):
if 4 <= hour <= 7:
return 'morning'
elif 8 <= hour <= 11:
return 'before midday'
basket_times['periode'] = basket_times['hour'].map(get_periode)
return basket_times
print a(df)
print b(df1)
答案 1 :(得分:1)
您可以定义一个将时间段映射到所需字符串的函数,然后使用map
。
def get_periode(hour):
if 4 <= hour <= 7:
return 'morning'
elif 8 <= hour <= 11:
return 'before midday'
basket_times['periode'] = basket_times['hour'].map(get_periode)