我有一个看起来像这样的数据框:
Market Price date outtime intime ttype
ATLJFKJFKATL 150 20190403 0215 0600 2
ATLJFK 77 20190403 0215 null 1
JFKATL 88 20190403 0600 null 1
JFKATL 77 20190403 0400 null 1
我想获取所有往返类型(2)的ttype(往返= 2,一种方式= 1),并将其与相应的一种方式进行匹配,然后添加两列,每列的价格相同。我该怎么办?
结果数据框:
Market Price date outtime intime outbound inbound
ATLJFKJFKATL 150 20190403 0215 0600 77 88
它也可能像这样:
Market Price date outtime intime inOutList
ATLJFKJFKATL 150 20190403 0215 0600 [77,88]
这两种方法都可行。 有时,没有一种匹配的方法,因此该值将为null或为空。
答案 0 :(得分:0)
您需要两次往返旅行。您的加入密钥为Market
,date
和time
。往返市场必须分为6个字符代码以匹配单程市场:
首先,让我们将数据框分为单程和往返:
import pyspark.sql.functions as psf
single, roundtrip = [df.filter(psf.col('ttype') == i).drop('ttype') for i in [1, 2]]
要提取往返的出入境市场,我们只需使用substring
:
roundtrip = roundtrip \
.withColumn('outMarket', psf.substring('Market', 0, 6)) \
.withColumn('inMarket', psf.substring('Market', 7, 6))
我们现在可以加入两次(出站和入站):
single = single \
.drop('intime') \
.withColumnRenamed('outtime', 'time') \
.withColumnRenamed('Price', 'bound')
single.persist()
for bound in ['out', 'in']:
roundtrip = roundtrip \
.join(
single.select([psf.col(c).alias(bound + c) for c in single.columns if c != 'date'] + ['date']),
on=[bound + c for c in ['Market', 'time']] + ['date'], how='left')
roundtrip.show()
+--------+------+--------+---------+-------+------------+-----+--------+-------+
|inMarket|intime| date|outMarket|outtime| Market|Price|outbound|inbound|
+--------+------+--------+---------+-------+------------+-----+--------+-------+
| JFKATL| 0600|20190403| ATLJFK| 0215|ATLJFKJFKATL| 150| 77| 88|
+--------+------+--------+---------+-------+------------+-----+--------+-------+