pandas DataFrame具有如下所示的resample
方法,我想通过查询BigQuery来实现等效方法。
pandas中的示例方法
现在我有这样的数据。假设相同的数据存储在bigquery中。
In [2]: df.head()
Out[2]:
Open High Low Close Volume
Gmt time
2016-01-03 22:00:00 1.08730 1.08730 1.08702 1.08714 8.62
2016-01-03 22:01:00 1.08718 1.08718 1.08713 1.08713 3.75
2016-01-03 22:02:00 1.08714 1.08721 1.08714 1.08720 4.60
2016-01-03 22:03:00 1.08717 1.08721 1.08714 1.08721 7.57
2016-01-03 22:04:00 1.08718 1.08718 1.08711 1.08711 5.52
然后使用DataFrame以5分钟的频率重新采样数据。
In [3]: ohlcv = {
: 'Open':'first',
: 'High':'max',
: 'Low':'min',
: 'Close':'last',
: 'Volume':'sum'
: }
: df = df.resample('5T').apply(ohlcv) # 5 minutes frequency
: df = df[['Open', 'High', 'Low', 'Close', 'Volume']] # reorder columns
: df.head()
:
:
Out[3]:
Open High Low Close Volume
Gmt time
2016-01-03 22:00:00 1.08730 1.08730 1.08702 1.08711 30.06
2016-01-03 22:05:00 1.08711 1.08727 1.08709 1.08709 190.63
2016-01-03 22:10:00 1.08708 1.08709 1.08662 1.08666 168.79
2016-01-03 22:15:00 1.08666 1.08674 1.08666 1.08667 223.83
2016-01-03 22:20:00 1.08667 1.08713 1.08666 1.08667 170.17
这可以在从bigquery获取1分钟频率数据后完成
但是有没有办法QUERY
重新采样bigquery?
pandas DataFrame重新采样的详细说明。
Open High Low Close Volume
Gmt time
# 1 minute frequency data stored in bigquery
2016-01-03 22:00:00 1.08730 1.08730 1.08702 1.08714 8.62
2016-01-03 22:01:00 1.08718 1.08718 1.08713 1.08713 3.75
2016-01-03 22:02:00 1.08714 1.08721 1.08714 1.08720 4.60
2016-01-03 22:03:00 1.08717 1.08721 1.08714 1.08721 7.57
2016-01-03 22:04:00 1.08718 1.08718 1.08711 1.08711 5.52
2016-01-03 22:05:00 1.08711 1.08714 1.08711 1.08711 27.47
2016-01-03 22:06:00 1.08717 1.08720 1.08711 1.08711 21.58
2016-01-03 22:07:00 1.08713 1.08718 1.08712 1.08715 28.12
2016-01-03 22:08:00 1.08714 1.08723 1.08712 1.08718 49.74
2016-01-03 22:09:00 1.08722 1.08727 1.08709 1.08709 63.72
# expected query result
# above will be resampled into below..
2016-01-03 22:00:00 1.08730 1.08730 1.08702 1.08711 30.06
2016-01-03 22:05:00 1.08711 1.08727 1.08709 1.08709 190.63
# method to resample 'first' 'max' 'min' 'last' 'sum'
1分钟频率的前5行(22:00至22:04)重新采样为1行(22:00),
接下来的5行(22:05到22:09)进入(22:05)
重采样方法分别为first
,max
,min
,last
和sum
。
first
计算组的第一个值(这里表示5行)
max
计算最大值,
min
计算最小值,
last
计算最后一个值,
sum
计算组
有关详细信息,请参阅pandas Document
答案 0 :(得分:4)
尝试以下
#standardSQL
SELECT * EXCEPT(step)
FROM (
SELECT *, TIMESTAMP_DIFF(TIMESTAMP(ts),
TIMESTAMP(MIN(ts) OVER(ORDER BY ts)), MINUTE) AS step
FROM yourTable
)
WHERE MOD(step, 5) = 0
-- ORDER BY ts
可以通过更改5
中的MOD(step, 5)
和TIMESTAMP_DIFF中的MINUTE
您可以使用以下虚拟数据
来玩这个WITH yourTable AS (
SELECT '2016-01-03 22:00:00' AS ts, 1.08730 AS Open, 1.08730 AS High, 1.08702 AS Low, 1.08714 AS Close, 8.62 AS Volume UNION ALL
SELECT '2016-01-03 22:01:00', 1.08718, 1.08718, 1.08713, 1.08713, 3.75 UNION ALL
SELECT '2016-01-03 22:02:00', 1.08714, 1.08721, 1.08714, 1.08720, 4.60 UNION ALL
SELECT '2016-01-03 22:03:00', 1.08717, 1.08721, 1.08714, 1.08721, 7.57 UNION ALL
SELECT '2016-01-03 22:04:00', 1.08718, 1.08718, 1.08711, 1.08711, 5.52 UNION ALL
SELECT '2016-01-03 22:05:00', 1.08718, 1.08718, 1.08713, 1.08713, 3.75 UNION ALL
SELECT '2016-01-03 22:06:00', 1.08714, 1.08721, 1.08714, 1.08720, 4.60 UNION ALL
SELECT '2016-01-03 22:07:00', 1.08717, 1.08721, 1.08714, 1.08721, 7.57 UNION ALL
SELECT '2016-01-03 22:08:00', 1.08718, 1.08718, 1.08711, 1.08711, 5.52 UNION ALL
SELECT '2016-01-03 22:09:00', 1.08718, 1.08718, 1.08713, 1.08713, 3.75 UNION ALL
SELECT '2016-01-03 22:10:00', 1.08714, 1.08721, 1.08714, 1.08720, 4.60 UNION ALL
SELECT '2016-01-03 22:11:00', 1.08717, 1.08721, 1.08714, 1.08721, 7.57 UNION ALL
SELECT '2016-01-03 22:12:00', 1.08718, 1.08718, 1.08711, 1.08711, 5.52
)
以下版本实施" panda' resample" (根据更新的问题中的逻辑)
#standardSQL
SELECT
MIN(ts) AS ts,
ARRAY_AGG(Open ORDER BY ts)[OFFSET (0)] AS Open,
MAX(High) AS High,
MIN(Low) AS Low,
ARRAY_AGG(Close ORDER BY ts DESC)[OFFSET (0)] AS Close,
SUM(Volume) AS Volume
FROM (
SELECT *, DIV(TIMESTAMP_DIFF(TIMESTAMP(ts),
TIMESTAMP(MIN(ts) OVER(ORDER BY ts)), MINUTE), 5) AS grp
FROM yourTable
)
GROUP BY grp
-- ORDER BY ts
或者进一步简化版本,只有一个GROUP BY和窗口函数。同时假设您的数据晚于2000-01-01 00:00:00' - 否则你可以相应调整
#standardSQL
SELECT
MIN(ts) AS ts,
ARRAY_AGG(Open ORDER BY ts)[OFFSET (0)] AS Open,
MAX(High) AS High,
MIN(Low) AS Low,
ARRAY_AGG(Close ORDER BY ts DESC)[OFFSET (0)] AS Close,
SUM(Volume) AS Volume
FROM yourTable
GROUP BY DIV(TIMESTAMP_DIFF(TIMESTAMP(ts),
TIMESTAMP('2000-01-01 00:00:00'), MINUTE), 5)
-- ORDER BY ts