根据值将新行添加到pyspark数据框

时间:2019-06-15 16:56:19

标签: pyspark

我有这样的数据框:

client_username|workstation|session_duration|access_point_name|start_date|
XX1@AD         |Apple      |1.55            |idf_1            |2019-06-01|
XX2@AD         |Apple      |30.12           |idf_2            |2019-06-04|
XX3@AD         |Apple      |78.25           |idf_3            |2019-06-02|
XX4@AD         |Apple      |0.45            |idf_1            |2019-06-02|
XX1@AD         |Apple      |23.11           |idf_1            |2019-06-02|

client_username - id of user in domain
workstation - user workstation
session_duration - duration (in hours) of the active session (user logged on hist host)
access_point_name - the name of access point that supplies the network to users host
start_date - start session

我想这样实现数据帧:

client_username|workstation|session_duration|access_point_name|start_date|
XX1@AD         |Apple      |1.55            |idf_1            |2019-06-01|
XX2@AD         |Apple      |8               |idf_2            |2019-06-04|
XX2@AD         |Apple      |8               |idf_2            |2019-06-05|
XX3@AD         |Apple      |8               |idf_3            |2019-06-02|
XX3@AD         |Apple      |8               |idf_3            |2019-06-03|
XX3@AD         |Apple      |8               |idf_3            |2019-06-04|
XX3@AD         |Apple      |8               |idf_3            |2019-06-05|
XX4@AD         |Apple      |0.45            |idf_1            |2019-06-02|
XX1@AD         |Apple      |23.11           |idf_1            |2019-06-02|

想法如下: *如果课程时间超过24小时,但少于48小时,我想更改它:

XX2@AD         |Apple      |30.12           |idf_2            |2019-06-04|

对此:

XX2@AD         |Apple      |8               |idf_2            |2019-06-04|
XX2@AD         |Apple      |8               |idf_2            |2019-06-05|

会话时间更改为8小时,但天数增加为两天(2019-06-04和2019-06-05)。 持续时间超过48小时(3天),72小时(4天)等的分析情况。

我开始学习pyspark。我尝试在数据帧上使用unioncrossJoin,但目前对我来说这很复杂。我想使用pyspark来完成此任务。

1 个答案:

答案 0 :(得分:1)

您可以尝试以下几种方法:

方法1:字符串函数:repeatsubstring

  1. 计算重复次数n = ceil(session_duration/24)
  2. 创建一个字符串a,该字符串重复子字符串8, n次,然后使用 substring() regexp_replace()删除结尾的逗号,
  3. 用逗号分隔a,然后将其爆炸成possession_duration的行
  4. 在上述步骤中通过pos调整开始日期
  5. 将字符串session_duration投射到double

请参见下面的代码示例:

from pyspark.sql import functions as F

# assume the columns in your dataframe are read with proper data types
# for example using inferSchema=True
df = spark.read.csv('/path/to/file', header=True, inferSchema=True)

df1 = df.withColumn('n', F.ceil(F.col('session_duration')/24).astype('int')) \
        .withColumn('a', F.when(F.col('n')>1, F.expr('substring(repeat("8,",n),0,2*n-1)')).otherwise(F.col('session_duration')))

>>> df1.show()
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
|client_username|workstation|session_duration|access_point_name|         start_date|  n|      a|
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
|         XX1@AD|      Apple|            1.55|            idf_1|2019-06-01 00:00:00|  1|   1.55|
|         XX2@AD|      Apple|           30.12|            idf_2|2019-06-04 00:00:00|  2|    8,8|
|         XX3@AD|      Apple|           78.25|            idf_3|2019-06-02 00:00:00|  4|8,8,8,8|
|         XX4@AD|      Apple|            0.45|            idf_1|2019-06-02 00:00:00|  1|   0.45|
|         XX1@AD|      Apple|           23.11|            idf_1|2019-06-02 00:00:00|  1|  23.11|
+---------------+-----------+----------------+-----------------+-------------------+---+-------+

df_new = df1.select(
          'client_username'
        , 'workstation'
        , F.posexplode(F.split('a', ',')).alias('pos', 'session_duration')
        , 'access_point_name'
        , F.expr('date_add(start_date, pos)').alias('start_date')
    ).drop('pos')

>>> df_new.show()
+---------------+-----------+----------------+-----------------+----------+
|client_username|workstation|session_duration|access_point_name|start_date|
+---------------+-----------+----------------+-----------------+----------+
|         XX1@AD|      Apple|            1.55|            idf_1|2019-06-01|
|         XX2@AD|      Apple|               8|            idf_2|2019-06-04|
|         XX2@AD|      Apple|               8|            idf_2|2019-06-05|
|         XX3@AD|      Apple|               8|            idf_3|2019-06-02|
|         XX3@AD|      Apple|               8|            idf_3|2019-06-03|
|         XX3@AD|      Apple|               8|            idf_3|2019-06-04|
|         XX3@AD|      Apple|               8|            idf_3|2019-06-05|
|         XX4@AD|      Apple|            0.45|            idf_1|2019-06-02|
|         XX1@AD|      Apple|           23.11|            idf_1|2019-06-02|
+---------------+-----------+----------------+-----------------+----------+

上面的代码也可以写成一个链:

df_new = df.withColumn('n'
                , F.ceil(F.col('session_duration')/24).astype('int')
          ).withColumn('a'
                , F.when(F.col('n')>1, F.expr('substring(repeat("8,",n),0,2*n-1)')).otherwise(F.col('session_duration'))
          ).select('client_username'
                , 'workstation'
                , F.posexplode(F.split('a', ',')).alias('pos', 'session_duration')
                , 'access_point_name'
                , F.expr('date_add(start_date, pos)').alias('start_date')
          ).withColumn('session_duration'
                , F.col('session_duration').astype('double')
          ).drop('pos')

方法2:数组函数array_repeat(pyspark 2.4 +)

类似于Method-1,但是a已经是一个数组,因此不需要将字符串拆分为数组:

df1 = df.withColumn('n', F.ceil(F.col('session_duration')/24).astype('int')) \
        .withColumn('a', F.when(F.col('n')>1, F.expr('array_repeat(8,n)')).otherwise(F.array('session_duration')))

>>> df1.show()
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
|client_username|workstation|session_duration|access_point_name|         start_date|  n|                   a|
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
|         XX1@AD|      Apple|            1.55|            idf_1|2019-06-01 00:00:00|  1|              [1.55]|
|         XX2@AD|      Apple|           30.12|            idf_2|2019-06-04 00:00:00|  2|          [8.0, 8.0]|
|         XX3@AD|      Apple|           78.25|            idf_3|2019-06-02 00:00:00|  4|[8.0, 8.0, 8.0, 8.0]|
|         XX4@AD|      Apple|            0.45|            idf_1|2019-06-02 00:00:00|  1|              [0.45]|
|         XX1@AD|      Apple|           23.11|            idf_1|2019-06-02 00:00:00|  1|             [23.11]|
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+

df_new = df1.select('client_username'
            , 'workstation'
            , F.posexplode('a').alias('pos', 'session_duration')
            , 'access_point_name'
            , F.expr('date_add(start_date, pos)').alias('start_date')
       ).drop('pos')