循环遍历pandas数据帧中的行

时间:2014-07-01 22:31:22

标签: python pandas dataframe

我有两个数据框:一个只有公司名称和日期。其他只有时间戳。如下所示

    creationdate
0   2012-05-01 18:20:27.167000
1   2012-05-01 19:16:08.070000
2   2012-05-01 19:20:07.880000
3   2012-05-01 19:33:02.200000
4   2012-05-01 19:35:09.173000
5   2012-05-01 20:18:55.610000
6   2012-05-01 20:26:27.577000
7   2012-05-01 20:32:34.343000
8   2012-05-01 20:39:31.257000
9   2012-05-01 21:04:50.357000
10  2012-05-01 21:54:18.983000
11  2012-05-02 02:23:53.250000
12  2012-05-02 02:40:27.643000
13  2012-05-02 08:44:28.260000

并且

   sitename        date
0    Google  2012-05-01
1    Google  2012-05-02
2    Google  2012-05-03
3    Google  2012-05-04
4    Google  2012-05-05
5    Google  2012-05-06
6    Google  2012-05-07
7    Google  2012-05-08
8    Google  2012-05-09
9    Google  2012-05-10

如何有效地遍历第二个数据帧并从第二个数据帧中每个日期对应的第一个数据帧中提取时间戳。

1 个答案:

答案 0 :(得分:2)

合并(内部联接)这两个数据框应该有效:

In [96]: df1['date'] = pd.DatetimeIndex (df1.creationdate).date

In [97]: df2['date'] = pd.DatetimeIndex (df2.date).date

In [98]: df=df1.merge(df2, on='date', how='inner')

In [99]: df
Out[99]: 
                 creationdate        date sitename
0  2012-05-01 18:20:27.167000  2012-05-01   Google
1  2012-05-01 19:16:08.070000  2012-05-01   Google
2  2012-05-01 19:20:07.880000  2012-05-01   Google
3  2012-05-01 19:33:02.200000  2012-05-01   Google
4  2012-05-01 19:35:09.173000  2012-05-01   Google
5  2012-05-01 20:18:55.610000  2012-05-01   Google
6  2012-05-01 20:26:27.577000  2012-05-01   Google
7  2012-05-01 20:32:34.343000  2012-05-01   Google
8  2012-05-01 20:39:31.257000  2012-05-01   Google
9  2012-05-01 21:04:50.357000  2012-05-01   Google
10 2012-05-01 21:54:18.983000  2012-05-01   Google
11 2012-05-02 02:23:53.250000  2012-05-02   Google
12 2012-05-02 02:40:27.643000  2012-05-02   Google
13 2012-05-02 08:44:28.260000  2012-05-02   Google

然后你可以对df喜欢

进行分析
In [100]: df['time_diff'] = df.creationdate.diff()

In [101]: df.time_diff
Out[101]: 
0                NaT
1    00:55:40.903000
2    00:03:59.810000
3    00:12:54.320000
4    00:02:06.973000
5    00:43:46.437000
6    00:07:31.967000
7    00:06:06.766000
8    00:06:56.914000
9    00:25:19.100000
10   00:49:28.626000
11   04:29:34.267000
12   00:16:34.393000
13   06:04:00.617000
Name: time_diff, dtype: timedelta64[ns]

当然,您的creationdate需要datetime64[ns] NOT STRING。或者您需要转换pd.DatetimeIndex (df.creationdate)