通过Pandas DataFrame迭代,使用条件并添加列

时间:2016-03-04 19:53:21

标签: python pandas

我有购买数据并希望用新列标记它们,该列提供有关购买白天的信息。为此,我使用每次购买的时间戳列的小时。

标签应该像这样工作:

    #include <iostream>
using namespace std;
void a();
void b();
int main ()
{
    a();

}

void a()
{
    int a=5;
    int n[a];
    for(int i=0;i<a;i++)
    {   
        cin>>n[i];
    }   
}

void b()
{

}

我已经选择了时间戳的小时数。现在,我有一个包含50 mio记录的DataFrame,如下所示。

 hour 4 - 7 => 'morning'
 hour 8 - 11 => 'before midday'
 ...

目前我的方法是使用6x .iterrows(),每个都有不同的条件:

    user_id  timestamp              hour
0   11       2015-08-21 06:42:44    6
1   11       2015-08-20 13:38:58    13
2   11       2015-08-20 13:37:47    13
3   11       2015-08-21 06:59:05    6
4   11       2015-08-20 13:15:21    13

然后:

for index, row in basket_times[(basket_times['hour']  >= 4) & (basket_times['hour'] < 8)].iterrows():
    basket_times['periode'] = 'morning'

等等。

然而,50个mio记录的6个循环中的一个已经花费了一个小时。有更好的方法吗?

2 个答案:

答案 0 :(得分:1)

您可以使用布尔掩码尝试loc。我更改了df进行测试:

print basket_times
   user_id           timestamp  hour
0       11 2015-08-21 06:42:44     6
1       11 2015-08-20 13:38:58    13
2       11 2015-08-20 09:37:47     9
3       11 2015-08-21 06:59:05     6
4       11 2015-08-20 13:15:21    13

#create boolean masks
morning = (basket_times['hour']  >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour']  >= 8) & (basket_times['hour'] < 11)
aftermidday = (basket_times['hour']  >= 11) & (basket_times['hour'] < 15)
print morning
0     True
1    False
2    False
3     True
4    False
Name: hour, dtype: bool

print beforemidday
0    False
1    False
2     True
3    False
4    False
Name: hour, dtype: bool
print aftermidday
0    False
1     True
2    False
3    False
4     True
Name: hour, dtype: bool
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
basket_times.loc[aftermidday, 'periode'] = 'after midday'
print basket_times
   user_id           timestamp  hour        periode
0       11 2015-08-21 06:42:44     6        morning
1       11 2015-08-20 13:38:58    13   after midday
2       11 2015-08-20 09:37:47     9  before midday
3       11 2015-08-21 06:59:05     6        morning
4       11 2015-08-20 13:15:21    13   after midday

计时 - len(df) = 500k

In [87]: %timeit a(df)
10 loops, best of 3: 34 ms per loop

In [88]: %timeit b(df1)
1 loops, best of 3: 490 ms per loop

测试代码:

import pandas as pd
import io

temp=u"""user_id;timestamp;hour
11;2015-08-21 06:42:44;6
11;2015-08-20 10:38:58;10
11;2015-08-20 09:37:47;9
11;2015-08-21 06:59:05;6
11;2015-08-20 10:15:21;10"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None, parse_dates=[1])
df = pd.concat([df]*100000).reset_index(drop=True)
print df.shape
#(500000, 3)
df1 = df.copy()

def a(basket_times):
    morning = (basket_times['hour']  >= 4) & (basket_times['hour'] < 8)
    beforemidday = (basket_times['hour']  >= 8) & (basket_times['hour'] < 11)
    basket_times.loc[morning, 'periode'] = 'morning'
    basket_times.loc[beforemidday, 'periode'] = 'before midday'
    return basket_times

def b(basket_times):
    def get_periode(hour):
        if 4 <= hour <= 7:
            return 'morning'
        elif 8 <= hour <= 11:
            return 'before midday'

    basket_times['periode'] = basket_times['hour'].map(get_periode)
    return basket_times

print a(df)    
print b(df1)    

答案 1 :(得分:1)

您可以定义一个将时间段映射到所需字符串的函数,然后使用map

def get_periode(hour):
    if 4 <= hour <= 7:
        return 'morning'
    elif 8 <= hour <= 11:
        return 'before midday'

basket_times['periode'] = basket_times['hour'].map(get_periode)