我有这样的数据框:
client_username|workstation|session_duration|access_point_name|start_date|
XX1@AD |Apple |1.55 |idf_1 |2019-06-01|
XX2@AD |Apple |30.12 |idf_2 |2019-06-04|
XX3@AD |Apple |78.25 |idf_3 |2019-06-02|
XX4@AD |Apple |0.45 |idf_1 |2019-06-02|
XX1@AD |Apple |23.11 |idf_1 |2019-06-02|
client_username - id of user in domain
workstation - user workstation
session_duration - duration (in hours) of the active session (user logged on hist host)
access_point_name - the name of access point that supplies the network to users host
start_date - start session
我想这样实现数据帧:
client_username|workstation|session_duration|access_point_name|start_date|
XX1@AD |Apple |1.55 |idf_1 |2019-06-01|
XX2@AD |Apple |8 |idf_2 |2019-06-04|
XX2@AD |Apple |8 |idf_2 |2019-06-05|
XX3@AD |Apple |8 |idf_3 |2019-06-02|
XX3@AD |Apple |8 |idf_3 |2019-06-03|
XX3@AD |Apple |8 |idf_3 |2019-06-04|
XX3@AD |Apple |8 |idf_3 |2019-06-05|
XX4@AD |Apple |0.45 |idf_1 |2019-06-02|
XX1@AD |Apple |23.11 |idf_1 |2019-06-02|
想法如下: *如果课程时间超过24小时,但少于48小时,我想更改它:
XX2@AD |Apple |30.12 |idf_2 |2019-06-04|
对此:
XX2@AD |Apple |8 |idf_2 |2019-06-04|
XX2@AD |Apple |8 |idf_2 |2019-06-05|
会话时间更改为8小时,但天数增加为两天(2019-06-04和2019-06-05)。 持续时间超过48小时(3天),72小时(4天)等的分析情况。
我开始学习pyspark。我尝试在数据帧上使用union
或crossJoin
,但目前对我来说这很复杂。我想使用pyspark
来完成此任务。
答案 0 :(得分:1)
您可以尝试以下几种方法:
n = ceil(session_duration/24)
a
,该字符串重复子字符串8,
n
次,然后使用 substring()或 regexp_replace()删除结尾的逗号,
a
,然后将其爆炸成pos
和session_duration
的行pos
调整开始日期session_duration
投射到double
请参见下面的代码示例:
from pyspark.sql import functions as F
# assume the columns in your dataframe are read with proper data types
# for example using inferSchema=True
df = spark.read.csv('/path/to/file', header=True, inferSchema=True)
df1 = df.withColumn('n', F.ceil(F.col('session_duration')/24).astype('int')) \
.withColumn('a', F.when(F.col('n')>1, F.expr('substring(repeat("8,",n),0,2*n-1)')).otherwise(F.col('session_duration')))
>>> df1.show()
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
|client_username|workstation|session_duration|access_point_name| start_date| n| a|
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
| XX1@AD| Apple| 1.55| idf_1|2019-06-01 00:00:00| 1| 1.55|
| XX2@AD| Apple| 30.12| idf_2|2019-06-04 00:00:00| 2| 8,8|
| XX3@AD| Apple| 78.25| idf_3|2019-06-02 00:00:00| 4|8,8,8,8|
| XX4@AD| Apple| 0.45| idf_1|2019-06-02 00:00:00| 1| 0.45|
| XX1@AD| Apple| 23.11| idf_1|2019-06-02 00:00:00| 1| 23.11|
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
df_new = df1.select(
'client_username'
, 'workstation'
, F.posexplode(F.split('a', ',')).alias('pos', 'session_duration')
, 'access_point_name'
, F.expr('date_add(start_date, pos)').alias('start_date')
).drop('pos')
>>> df_new.show()
+---------------+-----------+----------------+-----------------+----------+
|client_username|workstation|session_duration|access_point_name|start_date|
+---------------+-----------+----------------+-----------------+----------+
| XX1@AD| Apple| 1.55| idf_1|2019-06-01|
| XX2@AD| Apple| 8| idf_2|2019-06-04|
| XX2@AD| Apple| 8| idf_2|2019-06-05|
| XX3@AD| Apple| 8| idf_3|2019-06-02|
| XX3@AD| Apple| 8| idf_3|2019-06-03|
| XX3@AD| Apple| 8| idf_3|2019-06-04|
| XX3@AD| Apple| 8| idf_3|2019-06-05|
| XX4@AD| Apple| 0.45| idf_1|2019-06-02|
| XX1@AD| Apple| 23.11| idf_1|2019-06-02|
+---------------+-----------+----------------+-----------------+----------+
上面的代码也可以写成一个链:
df_new = df.withColumn('n'
, F.ceil(F.col('session_duration')/24).astype('int')
).withColumn('a'
, F.when(F.col('n')>1, F.expr('substring(repeat("8,",n),0,2*n-1)')).otherwise(F.col('session_duration'))
).select('client_username'
, 'workstation'
, F.posexplode(F.split('a', ',')).alias('pos', 'session_duration')
, 'access_point_name'
, F.expr('date_add(start_date, pos)').alias('start_date')
).withColumn('session_duration'
, F.col('session_duration').astype('double')
).drop('pos')
类似于Method-1,但是a
已经是一个数组,因此不需要将字符串拆分为数组:
df1 = df.withColumn('n', F.ceil(F.col('session_duration')/24).astype('int')) \
.withColumn('a', F.when(F.col('n')>1, F.expr('array_repeat(8,n)')).otherwise(F.array('session_duration')))
>>> df1.show()
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
|client_username|workstation|session_duration|access_point_name| start_date| n| a|
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
| XX1@AD| Apple| 1.55| idf_1|2019-06-01 00:00:00| 1| [1.55]|
| XX2@AD| Apple| 30.12| idf_2|2019-06-04 00:00:00| 2| [8.0, 8.0]|
| XX3@AD| Apple| 78.25| idf_3|2019-06-02 00:00:00| 4|[8.0, 8.0, 8.0, 8.0]|
| XX4@AD| Apple| 0.45| idf_1|2019-06-02 00:00:00| 1| [0.45]|
| XX1@AD| Apple| 23.11| idf_1|2019-06-02 00:00:00| 1| [23.11]|
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
df_new = df1.select('client_username'
, 'workstation'
, F.posexplode('a').alias('pos', 'session_duration')
, 'access_point_name'
, F.expr('date_add(start_date, pos)').alias('start_date')
).drop('pos')