在处理Pyspark数据帧中的字符串和子字符串时使用的时间

时间:2018-06-21 04:29:14

标签: python apache-spark pyspark

我在pyspark中有一个数据框,如下所示。

df.show()

+---+----+----+------------+
| id|name|city|      ip_add|
+---+----+----+------------+
|  1| sam| Hyd|  191.10.0.1|
|  2| Tim| Mum|    10.0.0.1|
|  3| Jim| Mum|    10.0.0.1|
|  4| sam| SFO|222.19.18.15|
|  5|same| HOU| 12.10.12.07|
+---+----+----+------------+

我想根据某些列表填充一些列。 列表如下。

name_list = ['sam']
city_list = ['Mum']
ip_list = ['191.10', '10.0']

填充新列的条件

  1. 填充列name_check,如果name等于sam,则Y否则为N
  2. 如果city_check等于city,则填充列Mum,然后填充Y,否则填充N
  3. 如果ip_check的前两个数字集等于ip_add191.10,则填充列10.0,然后填充Y,否则填充N

我定义了如下函数。我想使用相同的功能,因此不必重复代码

from pyspark.sql.functions import when
def new_column(df, compare_list, column_to_add, column_to_check):
    final_df = df.withColumn(column_to_add, when(df[column_to_check].isin(compare_list), "Y").otherwise('N'))
    return final_df

第一列name_check变量:

name_column_to_add = 'name_check'
name_column_to_check = 'name'

调用功能:

name_df = new_column(df, name_list, name_column_to_add, name_column_to_check)
name_df.show()

+---+----+----+------------+----------+
| id|name|city|      ip_add|name_check|
+---+----+----+------------+----------+
|  1| sam| Hyd|  191.10.0.1|         Y|
|  2| Tim| Mum|    10.0.0.1|         N|
|  3| Jim| Mum|    10.0.0.1|         N|
|  4| sam| SFO|222.19.18.15|         Y|
|  5|same| HOU| 12.10.12.07|         N|
+---+----+----+------------+----------+

第二列city_check变量:

city_column_to_add = 'city_check'
city_column_to_check = 'city'

调用功能:

city_df = new_column(name_df, city_list, city_column_to_add, city_column_to_check)
city_df.show()

+---+----+----+------------+----------+----------+
| id|name|city|      ip_add|name_check|city_check|
+---+----+----+------------+----------+----------+
|  1| sam| Hyd|  191.10.0.1|         Y|         N|
|  2| Tim| Mum|    10.0.0.1|         N|         Y|
|  3| Jim| Mum|    10.0.0.1|         N|         Y|
|  4| sam| SFO|222.19.18.15|         Y|         N|
|  5|same| HOU| 12.10.12.07|         N|         N|
+---+----+----+------------+----------+----------+

第三列ip_check变量:

ip_column_to_add = 'ip_check'
ip_column_to_check = 'ip_add'

调用功能:

ip_df = new_column(city, ip_list, ip_column_to_add, ip_column_to_check)
ip_df.show()

+---+----+----+------------+----------+----------+--------+
| id|name|city|      ip_add|name_check|city_check|ip_check|
+---+----+----+------------+----------+----------+--------+
|  1| sam| Hyd|  191.10.0.1|         Y|         N|       N|
|  2| Tim| Mum|    10.0.0.1|         N|         Y|       N|
|  3| Jim| Mum|    10.0.0.1|         N|         Y|       N|
|  4| sam| SFO|222.19.18.15|         Y|         N|       N|
|  5|same| HOU| 12.10.12.07|         N|         N|       N|
+---+----+----+------------+----------+----------+--------+

Expected_result:

+---+----+----+------------+----------+----------+--------+
| id|name|city|      ip_add|name_check|city_check|ip_check|
+---+----+----+------------+----------+----------+--------+
|  1| sam| Hyd|  191.10.0.1|         Y|         N|       Y|
|  2| Tim| Mum|    10.0.0.1|         N|         Y|       Y|
|  3| Jim| Mum|    10.0.0.1|         N|         Y|       Y|
|  4| sam| SFO|222.19.18.15|         Y|         N|       N|
|  5|same| HOU| 12.10.12.07|         N|         N|       N|
+---+----+----+------------+----------+----------+--------+

如何获得想要的结果?

2 个答案:

答案 0 :(得分:3)

这是您修改后的工作代码

name_list = ['sam']
city_list = ['Mum']
ip_list = ['191.10', '10.0']

from pyspark.sql import functions as f
def new_column(df, compare_list, column_to_add, column_to_check):
    final_df = df.withColumn(column_to_add, f.when(column_to_check.isin(compare_list), "Y").otherwise('N'))
    return final_df

name_column_to_add = 'name_check'
name_column_to_check = 'name'

name_df = new_column(df, name_list, name_column_to_add, f.col(name_column_to_check))

city_column_to_add = 'city_check'
city_column_to_check = 'city'

city_df = new_column(name_df, city_list, city_column_to_add, f.col(city_column_to_check))

ip_column_to_add = 'ip_check'
ip_column_to_check = 'ip_add'

ip_df = new_column(city_df, ip_list, ip_column_to_add, f.concat_ws('.', f.split(f.col(ip_column_to_check), '\\.')[0], f.split(f.col(ip_column_to_check), '\\.')[1]))

ip_df.show()

您所要做的就是 substring ip地址,只得到用。分隔的前两位数字。 splitconcat_ws函数,然后修改您的new_column函数以接受最后一个参数为column

所以您现在应该拥有

+---+----+----+------------+----------+----------+--------+
| id|name|city|      ip_add|name_check|city_check|ip_check|
+---+----+----+------------+----------+----------+--------+
|  1| sam| Hyd|  191.10.0.1|         Y|         N|       Y|
|  2| Tim| Mum|    10.0.0.1|         N|         Y|       Y|
|  3| Jim| Mum|    10.0.0.1|         N|         Y|       Y|
|  4| sam| SFO|222.19.18.15|         Y|         N|       N|
|  5|same| HOU| 12.10.12.07|         N|         N|       N|
+---+----+----+------------+----------+----------+--------+

我希望答案会有所帮助

答案 1 :(得分:1)

您可以使用substring_index与ip地址的一部分进行比较。这是您代码的稍好版本

import pyspark.sql.functions as fn

// create sample data
data = [
  (1, "sam", "Hyd", "191.10.0.1"),
  (2, "Tim", "Mum", "10.0.0.1"),
  (3, "Jim", "Mum", "10.0.0.1"),
  (4, "sam", "SFO", "222.19.18.15"),
  (5, "same", "HOU", "12.10.12.07")
  ]

// create dataframe
df = sc.parallelize(data).toDF(["id", "name", "city", "ip_add"])
df.show()

// add compare lists
name_list = ['sam']
city_list = ['Mum']
ip_list = ['191.10', '10.0']

// add checks
// notice usage of subtring_index to get part of ip address
checks = [
  (df.name, name_list, "name_check"),
  (df.city, city_list, "city_check"),
  (fn.substring_index(df.ip_add, '.', 2), ip_list, "ip_check")
]

// add column checks to the original dataframe
for (col_to_check, col_check_list, col_add) in checks:
  df = df.withColumn(col_add, fn.when(col_to_check.isin(col_check_list), "Y").otherwise('N'))

结果

df.show()
+---+----+----+------------+----------+----------+--------+
| id|name|city|      ip_add|name_check|city_check|ip_check|
+---+----+----+------------+----------+----------+--------+
|  1| sam| Hyd|  191.10.0.1|         Y|         N|       Y|
|  2| Tim| Mum|    10.0.0.1|         N|         Y|       Y|
|  3| Jim| Mum|    10.0.0.1|         N|         Y|       Y|
|  4| sam| SFO|222.19.18.15|         Y|         N|       N|
|  5|same| HOU| 12.10.12.07|         N|         N|       N|
+---+----+----+------------+----------+----------+--------+