我的问题是关于使用熊猫时间序列。
我有一个文件(Spots)有一个pandas时间序列,一个月的数据范围为7.5秒。 示例:
2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4
另一个文件(目标)只有时间信息。
示例:
2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25
我想检查目标日期时间属于哪个地点: 以上示例输出:
2016-11-01 00:00:00,0 '\t' count of targets in this spot = 2
2016-11-01 00:00:07.500000,1 '\t' count of targets in this spot = 0
2016-11-01 00:00:15,2 '\t' count of targets in this spot = 0
2016-11-01 00:00:22.500000,3 '\t' count of targets in this spot = 0
2016-11-01 00:00:30,4 '\t' count of targets in this spot = 2
提前非常感谢你。有点让我知道,如果这是明确的,否则我可以尝试解释更多。
答案 0 :(得分:1)
这是我的建议。首先,将另一列添加到目标框架。这样可以在将来合并后识别目标:
target['T'] = 1
连接目标和点并按时间排序:
both = pd.concat([spots,target]).sort_values(0)
# 0 1 T
#0 2016-11-01 00:00:00.000 0.0 NaN
#0 2016-11-01 00:00:05.000 NaN 1.0
#1 2016-11-01 00:00:07.000 NaN 1.0
#1 2016-11-01 00:00:07.500 1.0 NaN
#2 2016-11-01 00:00:15.000 2.0 NaN
#3 2016-11-01 00:00:22.500 3.0 NaN
#2 2016-11-01 00:00:23.000 NaN 1.0
#3 2016-11-01 00:00:25.000 NaN 1.0
#4 2016-11-01 00:00:30.000 4.0 NaN
预先填写地点ID:
both[1] = both[1].fillna(method='ffill').astype(int)
# 0 1 T
#0 2016-11-01 00:00:00.000 0 NaN
#0 2016-11-01 00:00:05.000 0 1.0
#1 2016-11-01 00:00:07.000 0 1.0
#1 2016-11-01 00:00:07.500 1 NaN
#2 2016-11-01 00:00:15.000 2 NaN
#3 2016-11-01 00:00:22.500 3 NaN
#2 2016-11-01 00:00:23.000 3 1.0
#3 2016-11-01 00:00:25.000 3 1.0
#4 2016-11-01 00:00:30.000 4 NaN
选择原始目标行和列:
both[both['T']==1][[0,1]]
# 0 1
#0 2016-11-01 00:00:05 0
#1 2016-11-01 00:00:07 0
#2 2016-11-01 00:00:23 3
#3 2016-11-01 00:00:25 3
如果您想计算点数目标,请使用groupby()
:
both.groupby(1).count()['T']
#1
#0 2
#1 0
#2 0
#3 2
#4 0
答案 1 :(得分:1)
使用np.searchsorted
和pd.value_counts
以及其他一些内容的组合。
idx = Spots.index.to_series()
i = idx.values
t = Target.Date.values
m = pd.value_counts(i[i.searchsorted(t) - 1]).to_dict()
Spots.assign(TargetCount=idx.map(lambda x: m.get(x, 0)))
Value TargetCount
Date
2016-11-01 00:00:00.000 0 2
2016-11-01 00:00:07.500 1 0
2016-11-01 00:00:15.000 2 0
2016-11-01 00:00:22.500 3 2
2016-11-01 00:00:30.000 4 0
工作原理
idx
是Spots
转化为pd.Series
的索引,因为我想稍后使用pd.Series.map
。i
是我将使用numpy
操作的基础searchsorted
数组
t
与i
相同... searchsorted
searchsorted
将遍历右侧数组中的每个元素,并找到相对于右侧数组应插入该元素的位置。此信息可用于查找元素所属的“bin”。然后我减去一个以与适当的指数对齐pd.value_counts
来计算map
构建新列。设置
from io import StringIO
import pandas as pd
tx1 = """2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4"""
tx2 = """2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25"""
Spots = pd.read_csv(StringIO(tx1), parse_dates=[0], index_col=0, names=['Date', 'Value'])
Target = pd.read_csv(StringIO(tx2), parse_dates=[0], names=['Date'])
答案 2 :(得分:0)
让我们使用merge_ordered
,fillna
和groupby
:
输入:
df_spots
Date Value
0 2016-11-01 00:00:00.000 0
1 2016-11-01 00:00:07.500 1
2 2016-11-01 00:00:15.000 2
3 2016-11-01 00:00:22.500 3
4 2016-11-01 00:00:30.000 4
df_target
Date
0 2016-11-01 00:00:05
1 2016-11-01 00:00:07
2 2016-11-01 00:00:23
3 2016-11-01 00:00:25
代码:
merged_df = pd.merge_ordered(df_spots, df_target, on = 'Date')
df_out = (merged_df.groupby(by=merged_df['Value']
.fillna(method='ffill'), as_index=False)
.agg({'Date':'first',
'Value':{'first':'first','count':lambda x:len(x)-1}}))
输出:
df_out
Date Value
first first count
0 2016-11-01 00:00:00.000 0.0 2.0
1 2016-11-01 00:00:07.500 1.0 0.0
2 2016-11-01 00:00:15.000 2.0 0.0
3 2016-11-01 00:00:22.500 3.0 2.0
4 2016-11-01 00:00:30.000 4.0 0.0
答案 3 :(得分:0)
使用pandas merge_asof (注意,所有时间值必须排序 - 可能必须先排序):
设置~~~~~~~~
import pandas as pd
# make date_range with 1 sec interval (fake targets)
rng = pd.date_range('2016-11-01', periods=100, freq='S')
# resample to make 7.5 sec intervals (fake spot bins)
ts = pd.Series(np.arange(100), index=rng)
ts_vals = ts.resample('7500L').asfreq().index
df_spots = pd.DataFrame({'spot': np.arange(len(ts_vals)), 'bin': ts_vals})
df_spots.head()
bin spot
0 2016-11-01 00:00:00.000 0
1 2016-11-01 00:00:07.500 1
2 2016-11-01 00:00:15.000 2
3 2016-11-01 00:00:22.500 3
4 2016-11-01 00:00:30.000 4
df_targets = pd.DataFrame(rng, columns=['tgt'])
df_targets.head()
tgt
0 2016-11-01 00:00:00
1 2016-11-01 00:00:01
2 2016-11-01 00:00:02
3 2016-11-01 00:00:03
4 2016-11-01 00:00:04
解决方案~~~~~~~
# this will produce spot membership for targets
df = pd.merge_asof(df_targets, df_spots, left_on='tgt', right_on='bin')
df.head()
tgt bin spot
0 2016-11-01 00:00:00 2016-11-01 00:00:00.000 0
1 2016-11-01 00:00:01 2016-11-01 00:00:00.000 0
2 2016-11-01 00:00:02 2016-11-01 00:00:00.000 0
3 2016-11-01 00:00:03 2016-11-01 00:00:00.000 0
4 2016-11-01 00:00:04 2016-11-01 00:00:00.000 0
5 2016-11-01 00:00:05 2016-11-01 00:00:00.000 0
6 2016-11-01 00:00:06 2016-11-01 00:00:00.000 0
7 2016-11-01 00:00:07 2016-11-01 00:00:00.000 0
8 2016-11-01 00:00:08 2016-11-01 00:00:07.500 1
9 2016-11-01 00:00:09 2016-11-01 00:00:07.500 1
# for spot counts...
df_counts = pd.DataFrame(df.groupby('bin')['spot'].count())
df_counts.head()
spot
bin
2016-11-01 00:00:00.000 8
2016-11-01 00:00:07.500 7
2016-11-01 00:00:15.000 8
2016-11-01 00:00:22.500 7
2016-11-01 00:00:30.000 8