Question

行程

id,timestamp
1008,2003-11-03 15:00:31
1008,2003-11-03 15:02:38
1008,2003-11-03 15:03:04
1008,2003-11-03 15:18:00
1009,2003-11-03 22:00:00
1009,2003-11-03 22:02:53
1009,2003-11-03 22:03:44 
1009,2003-11-14 10:00:00
1009,2003-11-14 10:02:02
1009,2003-11-14 10:03:10

提示

id,timestamp ,mode
1008,2003-11-03 15:18:49,car 
1009,2003-11-03 22:04:20,metro
1009,2003-11-14 10:04:20,bike

阅读csv文件：

coordinates = pd.read_csv('coordinates.csv')
mode = pd.read_csv('prompts.csv')

我必须在旅行结束时分配每种模式

结果：

id, timestamp, mode
1008, 2003-11-03 15:00:31, null
1008, 2003-11-03 15:02:38, null
1008, 2003-11-03 15:03:04, null
1008, 2003-11-03 15:18:00, car
1009, 2003-11-03 22:00:00, null
1009, 2003-11-03 22:02:53, null
1009, 2003-11-03 22:03:44, metro
1009, 2003-11-14 10:00:00, null
1009, 2003-11-14 10:02:02, null
1009, 2003-11-14 10:03:10, bike

注意

我使用大型数据集进行旅行（4GB），使用小型数据集进行模式（500MB）

Answer 1

根据您更新的示例，您可以通过查找大于行程时间戳的第一个提示时间戳来表示行程。具有相同提示时间戳的所有行将对应于相同的行程。然后，您需要为每个组设置最大行程时间戳的模式。

执行此操作的一种方法是使用2 pyspark.sql.Window s。

假设您从以下两个PySpark DataFrame开始，trips和prompts：

trips.show(truncate=False)
#+----+-------------------+
#|id  |timestamp          |
#+----+-------------------+
#|1008|2003-11-03 15:00:31|
#|1008|2003-11-03 15:02:38|
#|1008|2003-11-03 15:03:04|
#|1008|2003-11-03 15:18:00|
#|1009|2003-11-03 22:00:00|
#|1009|2003-11-03 22:02:53|
#|1009|2003-11-03 22:03:44|
#|1009|2003-11-14 10:00:00|
#|1009|2003-11-14 10:02:02|
#|1009|2003-11-14 10:03:10|
#|1009|2003-11-15 10:00:00|
#+----+-------------------+

prompts.show(truncate=False)
#+----+-------------------+-----+
#|id  |timestamp          |mode |
#+----+-------------------+-----+
#|1008|2003-11-03 15:18:49|car  |
#|1009|2003-11-03 22:04:20|metro|
#|1009|2003-11-14 10:04:20|bike |
#+----+-------------------+-----+

使用id列将这两个表连接在一起，条件是提示时间戳大于或等于行程时间戳。对于某些行程时间戳，这将导致多个提示时间戳。我们可以通过为每个('id', 'trip.timestamp')组选择最小提示时间戳来消除此问题 - 我将此临时列indicator称为“{1}}”，并使用窗口w1进行计算。

接下来在('id', 'indicator')上执行一个窗口，找到每个组的最大行程时间戳。将此值设置为mode。所有其他行都将设置为pyspark.sql.functions.lit(None)。

最后，您可以计算trips中跳闸时间戳大于最大提示时间戳的所有条目。这些将是与提示不匹配的旅行。联合匹配和无与伦比的联盟。

import pyspark.sql.functions as f
from pyspark.sql import Window

w1 = Window.partitionBy('id', 'trips.timestamp')
w2 = Window.partitionBy('id', 'indicator')

matched = trips.alias('trips').join(prompts.alias('prompts'), on='id', how='left')\
    .where('prompts.timestamp >= trips.timestamp' )\
    .select(
        'id',
        'trips.timestamp',
        'mode',
        f.when(
            f.col('prompts.timestamp') == f.min('prompts.timestamp').over(w1),
            f.col('prompts.timestamp'),
        ).otherwise(f.lit(None)).alias('indicator')
    )\
    .where(~f.isnull('indicator'))\
    .select(
        'id',
        f.col('trips.timestamp').alias('timestamp'),
        f.when(
            f.col('trips.timestamp') == f.max(f.col('trips.timestamp')).over(w2),
            f.col('mode')
        ).otherwise(f.lit(None)).alias('mode')
    )

unmatched = trips.alias('t').join(prompts.alias('p'), on='id', how='left')\
    .withColumn('max_prompt_time', f.max('p.timestamp').over(Window.partitionBy('id')))\
    .where('t.timestamp > max_prompt_time')\
    .select('id', 't.timestamp', f.lit(None).alias('mode'))\
    .distinct()

输出：

matched.union(unmatched).sort('id', 'timestamp').show()

+----+-------------------+-----+
|  id|          timestamp| mode|
+----+-------------------+-----+
|1008|2003-11-03 15:00:31| null|
|1008|2003-11-03 15:02:38| null|
|1008|2003-11-03 15:03:04| null|
|1008|2003-11-03 15:18:00|  car|
|1009|2003-11-03 22:00:00| null|
|1009|2003-11-03 22:02:53| null|
|1009|2003-11-03 22:03:44|metro|
|1009|2003-11-14 10:00:00| null|
|1009|2003-11-14 10:02:02| null|
|1009|2003-11-14 10:03:10| bike|
|1009|2003-11-15 10:00:00| null|
+----+-------------------+-----+

Answer 2

这将是一个天真的解决方案，假设你的坐标DataFrame已经按时间戳排序，id是唯一的，你的数据集适合内存。如果不是后者，我建议使用dask并按ID对您的DataFrame进行分区。

进口：

import pandas as pd
import numpy as np

首先我们加入两个DataFrame。这将填充每个id的整个模式列。我们加入索引是因为这会加快操作速度，另请参阅“Improve Pandas Merge performance”。

mode = mode.set_index('id')
coordinates = coordinates.set_index('id')
merged = coordinates.join(mode, how='left')

我们需要索引为唯一值，以便我们的groupby操作能够正常工作。

merged = merged.reset_index()

然后我们应用一个函数来替换每个id的mode列中除最后一行之外的所有行。

def clean_mode_col(df):
    cleaned_mode_col = df['mode'].copy()
    cleaned_mode_col.iloc[:-1] = np.nan
    df['mode'] = cleaned_mode_col
    return df
merged  = merged.groupby('id').apply(clean_mode_col)

如上所述，您可以使用dask来并行执行合并代码，如下所示：

import dask.dataframe as dd
dd_coordinates = dd.from_pandas(coordinates).set_index('id')
dd_mode = dd.from_pandas(mode).set_index('id')
merged = dd.merge(dd_coordinates, dd_mode, left_index=True, right_index=True)
merged = merged.compute() #returns pandas DataFrame

set_index操作很慢，但使合并方式更快。

我没有测试这段代码。请提供包含您的DataFrame的可复制粘贴代码，这样我就不必复制并粘贴您描述中的所有文件（提示：使用pd.DataFrame.to_dict将DataFrame导出为字典并复制并粘贴进入你的代码）。

我怎么能在旅行结束时附上提示

2 个答案: