我在pyspark
中有一个数据框,如下所示。
df.show()
+---+----+----+------------+
| id|name|city| ip_add|
+---+----+----+------------+
| 1| sam| Hyd| 191.10.0.1|
| 2| Tim| Mum| 10.0.0.1|
| 3| Jim| Mum| 10.0.0.1|
| 4| sam| SFO|222.19.18.15|
| 5|same| HOU| 12.10.12.07|
+---+----+----+------------+
我想根据某些列表填充一些列。 列表如下。
name_list = ['sam']
city_list = ['Mum']
ip_list = ['191.10', '10.0']
填充新列的条件
name_check
,如果name
等于sam
,则Y
否则为N
。city_check
等于city
,则填充列Mum
,然后填充Y
,否则填充N
。ip_check
的前两个数字集等于ip_add
或191.10
,则填充列10.0
,然后填充Y
,否则填充N
。我定义了如下函数。我想使用相同的功能,因此不必重复代码
from pyspark.sql.functions import when
def new_column(df, compare_list, column_to_add, column_to_check):
final_df = df.withColumn(column_to_add, when(df[column_to_check].isin(compare_list), "Y").otherwise('N'))
return final_df
第一列name_check
变量:
name_column_to_add = 'name_check'
name_column_to_check = 'name'
调用功能:
name_df = new_column(df, name_list, name_column_to_add, name_column_to_check)
name_df.show()
+---+----+----+------------+----------+
| id|name|city| ip_add|name_check|
+---+----+----+------------+----------+
| 1| sam| Hyd| 191.10.0.1| Y|
| 2| Tim| Mum| 10.0.0.1| N|
| 3| Jim| Mum| 10.0.0.1| N|
| 4| sam| SFO|222.19.18.15| Y|
| 5|same| HOU| 12.10.12.07| N|
+---+----+----+------------+----------+
第二列city_check
变量:
city_column_to_add = 'city_check'
city_column_to_check = 'city'
调用功能:
city_df = new_column(name_df, city_list, city_column_to_add, city_column_to_check)
city_df.show()
+---+----+----+------------+----------+----------+
| id|name|city| ip_add|name_check|city_check|
+---+----+----+------------+----------+----------+
| 1| sam| Hyd| 191.10.0.1| Y| N|
| 2| Tim| Mum| 10.0.0.1| N| Y|
| 3| Jim| Mum| 10.0.0.1| N| Y|
| 4| sam| SFO|222.19.18.15| Y| N|
| 5|same| HOU| 12.10.12.07| N| N|
+---+----+----+------------+----------+----------+
第三列ip_check
变量:
ip_column_to_add = 'ip_check'
ip_column_to_check = 'ip_add'
调用功能:
ip_df = new_column(city, ip_list, ip_column_to_add, ip_column_to_check)
ip_df.show()
+---+----+----+------------+----------+----------+--------+
| id|name|city| ip_add|name_check|city_check|ip_check|
+---+----+----+------------+----------+----------+--------+
| 1| sam| Hyd| 191.10.0.1| Y| N| N|
| 2| Tim| Mum| 10.0.0.1| N| Y| N|
| 3| Jim| Mum| 10.0.0.1| N| Y| N|
| 4| sam| SFO|222.19.18.15| Y| N| N|
| 5|same| HOU| 12.10.12.07| N| N| N|
+---+----+----+------------+----------+----------+--------+
Expected_result:
+---+----+----+------------+----------+----------+--------+
| id|name|city| ip_add|name_check|city_check|ip_check|
+---+----+----+------------+----------+----------+--------+
| 1| sam| Hyd| 191.10.0.1| Y| N| Y|
| 2| Tim| Mum| 10.0.0.1| N| Y| Y|
| 3| Jim| Mum| 10.0.0.1| N| Y| Y|
| 4| sam| SFO|222.19.18.15| Y| N| N|
| 5|same| HOU| 12.10.12.07| N| N| N|
+---+----+----+------------+----------+----------+--------+
如何获得想要的结果?
答案 0 :(得分:3)
这是您修改后的工作代码
name_list = ['sam']
city_list = ['Mum']
ip_list = ['191.10', '10.0']
from pyspark.sql import functions as f
def new_column(df, compare_list, column_to_add, column_to_check):
final_df = df.withColumn(column_to_add, f.when(column_to_check.isin(compare_list), "Y").otherwise('N'))
return final_df
name_column_to_add = 'name_check'
name_column_to_check = 'name'
name_df = new_column(df, name_list, name_column_to_add, f.col(name_column_to_check))
city_column_to_add = 'city_check'
city_column_to_check = 'city'
city_df = new_column(name_df, city_list, city_column_to_add, f.col(city_column_to_check))
ip_column_to_add = 'ip_check'
ip_column_to_check = 'ip_add'
ip_df = new_column(city_df, ip_list, ip_column_to_add, f.concat_ws('.', f.split(f.col(ip_column_to_check), '\\.')[0], f.split(f.col(ip_column_to_check), '\\.')[1]))
ip_df.show()
您所要做的就是 substring ip地址,只得到用。分隔的前两位数字。 split
和concat_ws
函数,然后修改您的new_column
函数以接受最后一个参数为column
所以您现在应该拥有
+---+----+----+------------+----------+----------+--------+
| id|name|city| ip_add|name_check|city_check|ip_check|
+---+----+----+------------+----------+----------+--------+
| 1| sam| Hyd| 191.10.0.1| Y| N| Y|
| 2| Tim| Mum| 10.0.0.1| N| Y| Y|
| 3| Jim| Mum| 10.0.0.1| N| Y| Y|
| 4| sam| SFO|222.19.18.15| Y| N| N|
| 5|same| HOU| 12.10.12.07| N| N| N|
+---+----+----+------------+----------+----------+--------+
我希望答案会有所帮助
答案 1 :(得分:1)
您可以使用substring_index
与ip地址的一部分进行比较。这是您代码的稍好版本
import pyspark.sql.functions as fn
// create sample data
data = [
(1, "sam", "Hyd", "191.10.0.1"),
(2, "Tim", "Mum", "10.0.0.1"),
(3, "Jim", "Mum", "10.0.0.1"),
(4, "sam", "SFO", "222.19.18.15"),
(5, "same", "HOU", "12.10.12.07")
]
// create dataframe
df = sc.parallelize(data).toDF(["id", "name", "city", "ip_add"])
df.show()
// add compare lists
name_list = ['sam']
city_list = ['Mum']
ip_list = ['191.10', '10.0']
// add checks
// notice usage of subtring_index to get part of ip address
checks = [
(df.name, name_list, "name_check"),
(df.city, city_list, "city_check"),
(fn.substring_index(df.ip_add, '.', 2), ip_list, "ip_check")
]
// add column checks to the original dataframe
for (col_to_check, col_check_list, col_add) in checks:
df = df.withColumn(col_add, fn.when(col_to_check.isin(col_check_list), "Y").otherwise('N'))
结果
df.show()
+---+----+----+------------+----------+----------+--------+
| id|name|city| ip_add|name_check|city_check|ip_check|
+---+----+----+------------+----------+----------+--------+
| 1| sam| Hyd| 191.10.0.1| Y| N| Y|
| 2| Tim| Mum| 10.0.0.1| N| Y| Y|
| 3| Jim| Mum| 10.0.0.1| N| Y| Y|
| 4| sam| SFO|222.19.18.15| Y| N| N|
| 5|same| HOU| 12.10.12.07| N| N| N|
+---+----+----+------------+----------+----------+--------+