两个日期时间文件之间的熊猫系列

时间:2017-05-03 02:27:47

标签: python-2.7 pandas data-science

我的问题是关于使用熊猫时间序列。

我有一个文件(Spots)有一个pandas时间序列,一个月的数据范围为7.5秒。 示例:

2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4

另一个文件(目标)只有时间信息。

示例:

2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25

我想检查目标日期时间属于哪个地点: 以上示例输出:

2016-11-01 00:00:00,0 '\t' count of targets in this spot = 2
2016-11-01 00:00:07.500000,1 '\t' count of targets in this spot = 0
2016-11-01 00:00:15,2 '\t' count of targets in this spot = 0
2016-11-01 00:00:22.500000,3 '\t' count of targets in this spot = 0
2016-11-01 00:00:30,4 '\t' count of targets in this spot = 2

提前非常感谢你。有点让我知道,如果这是明确的,否则我可以尝试解释更多。

4 个答案:

答案 0 :(得分:1)

这是我的建议。首先,将另一列添加到目标框架。这样可以在将来合并后识别目标:

target['T'] = 1

连接目标和点并按时间排序:

both = pd.concat([spots,target]).sort_values(0)
#                        0    1    T
#0 2016-11-01 00:00:00.000  0.0  NaN
#0 2016-11-01 00:00:05.000  NaN  1.0
#1 2016-11-01 00:00:07.000  NaN  1.0
#1 2016-11-01 00:00:07.500  1.0  NaN
#2 2016-11-01 00:00:15.000  2.0  NaN
#3 2016-11-01 00:00:22.500  3.0  NaN
#2 2016-11-01 00:00:23.000  NaN  1.0
#3 2016-11-01 00:00:25.000  NaN  1.0
#4 2016-11-01 00:00:30.000  4.0  NaN

预先填写地点ID:

both[1] = both[1].fillna(method='ffill').astype(int)
#                        0  1    T
#0 2016-11-01 00:00:00.000  0  NaN
#0 2016-11-01 00:00:05.000  0  1.0
#1 2016-11-01 00:00:07.000  0  1.0
#1 2016-11-01 00:00:07.500  1  NaN
#2 2016-11-01 00:00:15.000  2  NaN
#3 2016-11-01 00:00:22.500  3  NaN
#2 2016-11-01 00:00:23.000  3  1.0
#3 2016-11-01 00:00:25.000  3  1.0
#4 2016-11-01 00:00:30.000  4  NaN

选择原始目标行和列:

both[both['T']==1][[0,1]]
#                    0  1
#0 2016-11-01 00:00:05  0
#1 2016-11-01 00:00:07  0
#2 2016-11-01 00:00:23  3
#3 2016-11-01 00:00:25  3

如果您想计算点数目标,请使用groupby()

both.groupby(1).count()['T']
#1
#0    2
#1    0
#2    0
#3    2
#4    0

答案 1 :(得分:1)

使用np.searchsortedpd.value_counts以及其他一些内容的组合。

idx = Spots.index.to_series()
i = idx.values
t = Target.Date.values
m = pd.value_counts(i[i.searchsorted(t) - 1]).to_dict()
Spots.assign(TargetCount=idx.map(lambda x: m.get(x, 0)))

                         Value  TargetCount
Date                                       
2016-11-01 00:00:00.000      0            2
2016-11-01 00:00:07.500      1            0
2016-11-01 00:00:15.000      2            0
2016-11-01 00:00:22.500      3            2
2016-11-01 00:00:30.000      4            0

工作原理

  • idxSpots转化为pd.Series的索引,因为我想稍后使用pd.Series.map
  • i是我将使用
  • 执行numpy操作的基础searchsorted数组
  • ti相同... searchsorted
  • 的一部分
  • searchsorted将遍历右侧数组中的每个元素,并找到相对于右侧数组应插入该元素的位置。此信息可用于查找元素所属的“bin”。然后我减去一个以与适当的指数对齐
  • 然后我执行pd.value_counts来计算
  • 使用map构建新列。

设置

from io import StringIO
import pandas as pd

tx1 = """2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4"""

tx2 = """2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25"""

Spots = pd.read_csv(StringIO(tx1), parse_dates=[0], index_col=0, names=['Date', 'Value'])

Target = pd.read_csv(StringIO(tx2), parse_dates=[0], names=['Date'])

答案 2 :(得分:0)

让我们使用merge_orderedfillnagroupby

输入:

df_spots

                     Date  Value
0 2016-11-01 00:00:00.000      0
1 2016-11-01 00:00:07.500      1
2 2016-11-01 00:00:15.000      2
3 2016-11-01 00:00:22.500      3
4 2016-11-01 00:00:30.000      4

df_target

                 Date
0 2016-11-01 00:00:05
1 2016-11-01 00:00:07
2 2016-11-01 00:00:23
3 2016-11-01 00:00:25

代码:

merged_df = pd.merge_ordered(df_spots, df_target, on = 'Date')
df_out = (merged_df.groupby(by=merged_df['Value']
               .fillna(method='ffill'), as_index=False)
               .agg({'Date':'first',
                     'Value':{'first':'first','count':lambda x:len(x)-1}}))

输出:

df_out

                     Date Value      
                    first first count
0 2016-11-01 00:00:00.000   0.0   2.0
1 2016-11-01 00:00:07.500   1.0   0.0
2 2016-11-01 00:00:15.000   2.0   0.0
3 2016-11-01 00:00:22.500   3.0   2.0
4 2016-11-01 00:00:30.000   4.0   0.0

答案 3 :(得分:0)

使用pandas merge_asof (注意,所有时间值必须排序 - 可能必须先排序):

设置~~~~~~~~

import pandas as pd    

# make date_range with 1 sec interval (fake targets)
rng = pd.date_range('2016-11-01', periods=100, freq='S')

# resample to make 7.5 sec intervals (fake spot bins)
ts = pd.Series(np.arange(100), index=rng)
ts_vals = ts.resample('7500L').asfreq().index

df_spots = pd.DataFrame({'spot': np.arange(len(ts_vals)), 'bin': ts_vals})
df_spots.head()
                      bin  spot
0 2016-11-01 00:00:00.000  0   
1 2016-11-01 00:00:07.500  1   
2 2016-11-01 00:00:15.000  2   
3 2016-11-01 00:00:22.500  3   
4 2016-11-01 00:00:30.000  4 

df_targets = pd.DataFrame(rng, columns=['tgt'])
df_targets.head()

                  tgt
0 2016-11-01 00:00:00
1 2016-11-01 00:00:01
2 2016-11-01 00:00:02
3 2016-11-01 00:00:03
4 2016-11-01 00:00:04

解决方案~~~~~~~

# this will produce spot membership for targets
df = pd.merge_asof(df_targets, df_spots, left_on='tgt', right_on='bin')
df.head()
                  tgt                     bin  spot
0 2016-11-01 00:00:00 2016-11-01 00:00:00.000  0   
1 2016-11-01 00:00:01 2016-11-01 00:00:00.000  0   
2 2016-11-01 00:00:02 2016-11-01 00:00:00.000  0   
3 2016-11-01 00:00:03 2016-11-01 00:00:00.000  0   
4 2016-11-01 00:00:04 2016-11-01 00:00:00.000  0   
5 2016-11-01 00:00:05 2016-11-01 00:00:00.000  0   
6 2016-11-01 00:00:06 2016-11-01 00:00:00.000  0   
7 2016-11-01 00:00:07 2016-11-01 00:00:00.000  0   
8 2016-11-01 00:00:08 2016-11-01 00:00:07.500  1   
9 2016-11-01 00:00:09 2016-11-01 00:00:07.500  1   

# for spot counts...
df_counts = pd.DataFrame(df.groupby('bin')['spot'].count())
df_counts.head()
                         spot
bin                          
2016-11-01 00:00:00.000  8   
2016-11-01 00:00:07.500  7   
2016-11-01 00:00:15.000  8   
2016-11-01 00:00:22.500  7   
2016-11-01 00:00:30.000  8