我已经谷歌搜索了一段时间,并没有找到合适的解决方案。我有一个时间序列,有几百万行,结构很奇怪:
VisitorID Time VisitDuration
1 01.01.2014 00:01 80 seconds
2 01.01.2014 00:03 37 seconds
我想知道在某个时刻网站上有多少人。为此,我必须将这些数据转换为更大的数据:
Time VisitorsPresent
01.01.2014 00:01 1
01.01.2014 00:02 1
01.01.2014 00:03 2
...
但做这样的事情似乎非常低效。我的代码是:
dates = {}
for index, row in data.iterrows():
for i in range(0,int(row["duration"])):
dates[index+pd.DateOffset(seconds=i)] = dates.get(index+pd.DateOffset(seconds=i), 1) + 1
然后我可以将其转换为一系列并能够重新取样:
result = pd.Series(dates)
result.resample("5min",how="mean").plot()
你能指出我正确的方向吗?
EDIT ---
嗨HYRY这是一个头()
uid join_time_UTC duration 0 1 2014-03-07 16:58:01 2953 1 2 2014-03-07 17:13:14 1954 2 3 2014-03-07 17:47:38 223
答案 0 :(得分:7)
首先创建一些虚拟数据:
import numpy as np
import pandas as pd
start = pd.Timestamp("2014-11-01")
end = pd.Timestamp("2014-11-02")
N = 100000
t = np.random.randint(start.value, end.value, N)
t -= t % 1000000000
start = pd.to_datetime(np.array(t, dtype="datetime64[ns]"))
duration = pd.to_timedelta(np.random.randint(100, 1000, N), unit="s")
df = pd.DataFrame({"start":start, "duration":duration})
df["end"] = df.start + df.duration
print df.head(5)
以下是数据的样子:
duration start end
0 00:13:45 2014-11-01 08:10:45 2014-11-01 08:24:30
1 00:04:07 2014-11-01 23:15:49 2014-11-01 23:19:56
2 00:09:26 2014-11-01 14:04:10 2014-11-01 14:13:36
3 00:10:20 2014-11-01 19:40:45 2014-11-01 19:51:05
4 00:02:48 2014-11-01 02:25:47 2014-11-01 02:28:35
然后进行值计数:
enter_count = df.start.value_counts()
exit_count = df.end.value_counts()
df2 = pd.concat([enter_count, exit_count], axis=1, keys=["enter", "exit"])
df2.fillna(0, inplace=True)
print df2.head(5)
这是计数:
enter exit
2014-11-01 00:00:00 1 0
2014-11-01 00:00:02 2 0
2014-11-01 00:00:04 4 0
2014-11-01 00:00:06 2 0
2014-11-01 00:00:07 2 0
最后重新取样并绘图:
df2["diff"] = df2["enter"] - df2["exit"]
counts = df2["diff"].resample("5min", how="sum").fillna(0).cumsum()
counts.plot()
输出是: