Question

我想知道每个id在起始位置花了多长时间。

例如，在下面的数据集中，启动id为286的Geohash为“abcdef”。对于Id 286，Geohash“abcdef”出现在3个位置。因此，Id 286所花费的总时间是（2017-02-13 12：33：02.063 UTC - 2017-02-13 12:24:36 UTC）和（2017-02-13 12:34:29 UTC - 2017-02-13 12:33:08 UTC）。

        Id         DateTime                      Latitude     Longitude   Geohash
      0 286        2017-02-13 12:24:36 UTC       40.769230  -73.01205     abcdef
      1 286        2017-02-13 12:33:02.063 UTC   40.769230  -73.01202     abcdef
      2 286        2017-02-13 12:33:05.063 UTC   40.769230  -73.01202     cvzvvv
      3 286        2017-02-13 12:33:08 UTC       40.769280  -73.01212     abcdef
      4 286        2017-02-13 12:34:29 UTC       40.769306  -73.01207     hsffds
      5 368        2017-02-13 00:23:07.063 UTC   33.392820  -111.8262     weruio
      6 141        2017-02-13 00:00:41 UTC       33.287117  -111.84150    oqruqq

不确定pandas数据帧中是否有任何功能可以实现此操作。

任何帮助都会非常感激。 !!

Answer 1

以下是BigQuery Standard SQL

#standardSQL
SELECT 
  Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
FROM (
  SELECT 
    Id, Geohash, DateTime, 
    TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
    FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
  FROM yourTable
)
WHERE Geohash = FirstGeohash
GROUP BY Id, Geohash

您可以使用示例中的虚拟数据对其进行测试：

#standardSQL
WITH yourTable AS (
  SELECT 286 AS Id, TIMESTAMP '2017-02-13 12:24:36 UTC' AS DateTime, 40.769230 AS Latitude, -73.01205 AS Longitude, 'abcdef' AS Geohash UNION ALL
  SELECT 286, TIMESTAMP '2017-02-13 12:33:02.063 UTC', 40.769230, -73.01202, 'abcdef' UNION ALL
  SELECT 286, TIMESTAMP '2017-02-13 12:33:05.063 UTC', 40.769230, -73.01202, 'cvzvvv' UNION ALL
  SELECT 286, TIMESTAMP '2017-02-13 12:33:08 UTC', 40.769280, -73.01212, 'abcdef' UNION ALL
  SELECT 286, TIMESTAMP '2017-02-13 12:34:29 UTC', 40.769306, -73.01207, 'hsffds' UNION ALL
  SELECT 368, TIMESTAMP '2017-02-13 00:23:07.063 UTC', 33.392820, -111.8262, 'weruio' UNION ALL
  SELECT 141, TIMESTAMP '2017-02-13 00:00:41 UTC', 33.287117, -111.84150, 'oqruqq'
)
SELECT 
  Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
FROM (
  SELECT 
    Id, Geohash, DateTime, 
    TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
    FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
  FROM yourTable
)
WHERE Geohash = FirstGeohash
GROUP BY Id, Geohash

结果如下

Id  Geohash     StartDateTime           TimeSpent    
286  abcdef     2017-02-13 12:24:36 UTC       590    
368  weruio     2017-02-13 00:23:07 UTC      null    
141  oqruqq     2017-02-13 00:00:41 UTC      null

请注意：590以上是三页上的时间总和（以秒为单位） - 不只是在你的问题中所述的两页上 - 我认为这只是你身边的错误

Answer 2

如果我理解正确，你需要这样的东西：

def timedelta(df):
    df = df.sort_values(by='DateTime')
    return df.iloc[0]['DateTime'] - df.iloc[-1]['DateTime']

df.groupby(['Id', 'Geohash']).apply(timedelta)

查找id在每个位置所花费的时间

2 个答案: