查找id在每个位置所花费的时间

时间:2017-03-16 03:35:30

标签: pandas google-bigquery

我想知道每个id在起始位置花了多长时间。

例如,在下面的数据集中,启动id为286的Geohash为“abcdef”。对于Id 286,Geohash“abcdef”出现在3个位置。 因此,Id 286所花费的总时间是(2017-02-13 12:33:02.063 UTC - 2017-02-13 12:24:36 UTC)和(2017-02-13 12:34:29 UTC - 2017-02-13 12:33:08 UTC)。

        Id         DateTime                      Latitude     Longitude   Geohash
      0 286        2017-02-13 12:24:36 UTC       40.769230  -73.01205     abcdef
      1 286        2017-02-13 12:33:02.063 UTC   40.769230  -73.01202     abcdef
      2 286        2017-02-13 12:33:05.063 UTC   40.769230  -73.01202     cvzvvv
      3 286        2017-02-13 12:33:08 UTC       40.769280  -73.01212     abcdef
      4 286        2017-02-13 12:34:29 UTC       40.769306  -73.01207     hsffds
      5 368        2017-02-13 00:23:07.063 UTC   33.392820  -111.8262     weruio
      6 141        2017-02-13 00:00:41 UTC       33.287117  -111.84150    oqruqq

不确定pandas数据帧中是否有任何功能可以实现此操作。

任何帮助都会非常感激。 !!

2 个答案:

答案 0 :(得分:1)

以下是BigQuery Standard SQL

  
#standardSQL
SELECT 
  Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
FROM (
  SELECT 
    Id, Geohash, DateTime, 
    TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
    FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
  FROM yourTable
)
WHERE Geohash = FirstGeohash
GROUP BY Id, Geohash  

您可以使用示例中的虚拟数据对其进行测试:

#standardSQL
WITH yourTable AS (
  SELECT 286 AS Id, TIMESTAMP '2017-02-13 12:24:36 UTC' AS DateTime, 40.769230 AS Latitude, -73.01205 AS Longitude, 'abcdef' AS Geohash UNION ALL
  SELECT 286, TIMESTAMP '2017-02-13 12:33:02.063 UTC', 40.769230, -73.01202, 'abcdef' UNION ALL
  SELECT 286, TIMESTAMP '2017-02-13 12:33:05.063 UTC', 40.769230, -73.01202, 'cvzvvv' UNION ALL
  SELECT 286, TIMESTAMP '2017-02-13 12:33:08 UTC', 40.769280, -73.01212, 'abcdef' UNION ALL
  SELECT 286, TIMESTAMP '2017-02-13 12:34:29 UTC', 40.769306, -73.01207, 'hsffds' UNION ALL
  SELECT 368, TIMESTAMP '2017-02-13 00:23:07.063 UTC', 33.392820, -111.8262, 'weruio' UNION ALL
  SELECT 141, TIMESTAMP '2017-02-13 00:00:41 UTC', 33.287117, -111.84150, 'oqruqq'
)
SELECT 
  Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
FROM (
  SELECT 
    Id, Geohash, DateTime, 
    TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
    FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
  FROM yourTable
)
WHERE Geohash = FirstGeohash
GROUP BY Id, Geohash  

结果如下

Id  Geohash     StartDateTime           TimeSpent    
286  abcdef     2017-02-13 12:24:36 UTC       590    
368  weruio     2017-02-13 00:23:07 UTC      null    
141  oqruqq     2017-02-13 00:00:41 UTC      null    

请注意:590以上是三页上的时间总和(以秒为单位) - 不只是在你的问题中所述的两页上 - 我认为这只是你身边的错误

答案 1 :(得分:0)

如果我理解正确,你需要这样的东西:

def timedelta(df):
    df = df.sort_values(by='DateTime')
    return df.iloc[0]['DateTime'] - df.iloc[-1]['DateTime']

df.groupby(['Id', 'Geohash']).apply(timedelta)