我想知道每个id在起始位置花了多长时间。
例如,在下面的数据集中,启动id为286的Geohash为“abcdef”。对于Id 286,Geohash“abcdef”出现在3个位置。 因此,Id 286所花费的总时间是(2017-02-13 12:33:02.063 UTC - 2017-02-13 12:24:36 UTC)和(2017-02-13 12:34:29 UTC - 2017-02-13 12:33:08 UTC)。
Id DateTime Latitude Longitude Geohash
0 286 2017-02-13 12:24:36 UTC 40.769230 -73.01205 abcdef
1 286 2017-02-13 12:33:02.063 UTC 40.769230 -73.01202 abcdef
2 286 2017-02-13 12:33:05.063 UTC 40.769230 -73.01202 cvzvvv
3 286 2017-02-13 12:33:08 UTC 40.769280 -73.01212 abcdef
4 286 2017-02-13 12:34:29 UTC 40.769306 -73.01207 hsffds
5 368 2017-02-13 00:23:07.063 UTC 33.392820 -111.8262 weruio
6 141 2017-02-13 00:00:41 UTC 33.287117 -111.84150 oqruqq
不确定pandas数据帧中是否有任何功能可以实现此操作。
任何帮助都会非常感激。 !!
答案 0 :(得分:1)
以下是BigQuery Standard SQL
#standardSQL
SELECT
Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
FROM (
SELECT
Id, Geohash, DateTime,
TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
FROM yourTable
)
WHERE Geohash = FirstGeohash
GROUP BY Id, Geohash
您可以使用示例中的虚拟数据对其进行测试:
#standardSQL
WITH yourTable AS (
SELECT 286 AS Id, TIMESTAMP '2017-02-13 12:24:36 UTC' AS DateTime, 40.769230 AS Latitude, -73.01205 AS Longitude, 'abcdef' AS Geohash UNION ALL
SELECT 286, TIMESTAMP '2017-02-13 12:33:02.063 UTC', 40.769230, -73.01202, 'abcdef' UNION ALL
SELECT 286, TIMESTAMP '2017-02-13 12:33:05.063 UTC', 40.769230, -73.01202, 'cvzvvv' UNION ALL
SELECT 286, TIMESTAMP '2017-02-13 12:33:08 UTC', 40.769280, -73.01212, 'abcdef' UNION ALL
SELECT 286, TIMESTAMP '2017-02-13 12:34:29 UTC', 40.769306, -73.01207, 'hsffds' UNION ALL
SELECT 368, TIMESTAMP '2017-02-13 00:23:07.063 UTC', 33.392820, -111.8262, 'weruio' UNION ALL
SELECT 141, TIMESTAMP '2017-02-13 00:00:41 UTC', 33.287117, -111.84150, 'oqruqq'
)
SELECT
Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
FROM (
SELECT
Id, Geohash, DateTime,
TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
FROM yourTable
)
WHERE Geohash = FirstGeohash
GROUP BY Id, Geohash
结果如下
Id Geohash StartDateTime TimeSpent
286 abcdef 2017-02-13 12:24:36 UTC 590
368 weruio 2017-02-13 00:23:07 UTC null
141 oqruqq 2017-02-13 00:00:41 UTC null
请注意:590以上是三页上的时间总和(以秒为单位) - 不只是在你的问题中所述的两页上 - 我认为这只是你身边的错误
答案 1 :(得分:0)
如果我理解正确,你需要这样的东西:
def timedelta(df):
df = df.sort_values(by='DateTime')
return df.iloc[0]['DateTime'] - df.iloc[-1]['DateTime']
df.groupby(['Id', 'Geohash']).apply(timedelta)