数据框中列的匹配值

时间:2019-05-07 20:24:35

标签: apache-spark pyspark

我有一个看起来像这样的数据框:

Market         Price  date      outtime  intime  ttype
ATLJFKJFKATL   150    20190403  0215     0600    2
ATLJFK         77     20190403  0215     null    1
JFKATL         88     20190403  0600     null    1
JFKATL         77     20190403  0400     null    1

我想获取所有往返类型(2)的ttype(往返= 2,一种方式= 1),并将其与相应的一种方式进行匹配,然后添加两列,每列的价格相同。我该怎么办?

结果数据框:

Market         Price  date      outtime  intime  outbound  inbound
ATLJFKJFKATL   150    20190403  0215     0600    77        88

它也可能像这样:

Market         Price  date      outtime  intime  inOutList
ATLJFKJFKATL   150    20190403  0215     0600    [77,88]

这两种方法都可行。 有时,没有一种匹配的方法,因此该值将为null或为空。

1 个答案:

答案 0 :(得分:0)

您需要两次往返旅行。您的加入密钥为Marketdatetime。往返市场必须分为6个字符代码以匹配单程市场:

首先,让我们将数据框分为单程和往返:

import pyspark.sql.functions as psf
single, roundtrip = [df.filter(psf.col('ttype') == i).drop('ttype') for i in [1, 2]]

要提取往返的出入境市场,我们只需使用substring

roundtrip = roundtrip \
    .withColumn('outMarket', psf.substring('Market', 0, 6)) \
    .withColumn('inMarket', psf.substring('Market', 7, 6))

我们现在可以加入两次(出站和入站):

single = single \
    .drop('intime') \
    .withColumnRenamed('outtime', 'time') \
    .withColumnRenamed('Price', 'bound')
single.persist()

for bound in ['out', 'in']:
    roundtrip = roundtrip \
        .join(
            single.select([psf.col(c).alias(bound + c) for c in single.columns if c != 'date'] + ['date']), 
            on=[bound + c for c in ['Market', 'time']] + ['date'], how='left')

roundtrip.show()

        +--------+------+--------+---------+-------+------------+-----+--------+-------+
        |inMarket|intime|    date|outMarket|outtime|      Market|Price|outbound|inbound|
        +--------+------+--------+---------+-------+------------+-----+--------+-------+
        |  JFKATL|  0600|20190403|   ATLJFK|   0215|ATLJFKJFKATL|  150|      77|     88|
        +--------+------+--------+---------+-------+------------+-----+--------+-------+