我有一个带有时间戳和标签的文本文件:
0.000000 14.463912 tone
14.476425 16.891247 noise
16.891247 21.232923 not_music
21.232923 23.172289 not_music
23.172289 29.128018 not_music
如果我指定步长为1秒。 我想要这个列表爆炸到1秒长持续时间的时间范围 但仍带有最近的标签。如何将时间范围分解为更小的步骤但标签准确?
例如,如果我的步骤是1秒,那么 第一行将成为~14行,如:
0.0 1.0 tone
1.0 2.0 tone
.
.
.
13.0 14.0 tone
[14.0 , 14.46] and [14.47, 15.0] #fall in a grey zone , don't know
what to do
15.0 16.0 noise
到目前为止,我已设法读入文本文件并将其存储在如下列表中:
my_segments =[]
for line in open('./data/annotate.txt', 'rb').readlines():
start, end, label = line.split("\t")
start = float(start)
end = float(end)
label = label.strip()
my_segments.append((start, end, label))
# print my_segments
for i in range(len(my_segments)):
print my_segments[i]
我查看了@Jared的https://stackoverflow.com/a/18265979/4932791,详细介绍了如何使用numpy在给定步长的两个数字之间创建范围。像这样:
>>> numpy.arange(11, 17, 0.5)
array([ 11. , 11.5, 12. , 12.5, 13. , 13.5, 14. , 14.5, 15. ,
15.5, 16. , 16.5])
无法弄清楚如何在一系列范围内做类似的事情。
我设法提出的伪代码/算法是:
我认为要处理边缘情况,我应该将步长减小到0.25秒或类似的东西,如果当前步骤至少有40或50%重叠,则设置条件,然后我相应地分配标签。
更新: 我的非工作解决方案:
sliding_window = 0
#st,en = [0.0,1.0]
jumbo= []
for i in range(len(hold_segments)):
if sliding_window > hold_segments[i][0] and sliding_window+1 < hold_segments[i][1]:
jumbo.append((sliding_window,sliding_window+1,hold_segments[i][2]))
sliding_window=sliding_window+1
print hold_segments[i][2]
答案 0 :(得分:4)
假设您已将数据加载到名为df
value tag
index
0.000000 14.463912 ringtone
14.476425 16.891247 noise
16.891247 21.232923 not_music
21.232923 23.172289 music_B
23.172289 29.128018 music_A
df = df.reindex(
[i + 0.5 for i in range(math.floor(df.index.min()), math.ceil(df.value.max()))],
method='pad'
)
的数据框中,例如:
(df.index, df.value) = (df.index - 0.5, df.index + 0.5)
value tag
index
0.0 1.0 ringtone
1.0 2.0 ringtone
2.0 3.0 ringtone
3.0 4.0 ringtone
4.0 5.0 ringtone
5.0 6.0 ringtone
6.0 7.0 ringtone
7.0 8.0 ringtone
8.0 9.0 ringtone
9.0 10.0 ringtone
10.0 11.0 ringtone
11.0 12.0 ringtone
12.0 13.0 ringtone
13.0 14.0 ringtone
14.0 15.0 noise
15.0 16.0 noise
16.0 17.0 noise
17.0 18.0 not_music
18.0 19.0 not_music
19.0 20.0 not_music
20.0 21.0 not_music
21.0 22.0 music_B
22.0 23.0 music_B
23.0 24.0 music_A
24.0 25.0 music_A
25.0 26.0 music_A
26.0 27.0 music_A
27.0 28.0 music_A
28.0 29.0 music_A
29.0 30.0 music_A
然后使用以下内容恢复范围:
df <- data.frame(person_id = c('A','B','C','D','A','B','D','E','F','D','G','H','I','J'),
calendar_day = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3),
month = c('1/31/17'))
numvals <- cummax(as.numeric(factor(df$person_id)))
aggregate(numvals, list(month = df$month, calendar_day=df$calendar_day), max)
答案 1 :(得分:3)
我希望通过评论很清楚代码的作用。也适用于非整数步长
from __future__ import division
import numpy as np
my_segments = [
(0, 14.46, "ringtone"),
(14.46, 16.89, "noise"),
(16.89, 21.23, "not_music"),
]
def expand(segments, stepsize):
result = []
levels = [x[0] for x in segments] + [segments[-1][1]] #0, 14.46, 16.89, 21.23
i = 0 # tracks the index in segments that we need at the current step
for step in np.arange(0, levels[-1], stepsize):
# first check if the index needs to be updated
# update when the next level will be reached at the next 'stepsize / 2'
# (this effectively rounds to the nearest level)
if i < len(levels) - 2 and (step + stepsize / 2) > levels[i+1]:
i += 1
# now append the values
result.append((step, step + stepsize, segments[i][2]))
return result
stepsize = 0.02
print len(expand(my_segments, stepsize))
print my_segments[-1][1] / stepsize
>>> 1062 # steps are rounded up
>>> 1061.5