我使用PySpark的Spark 2。数据框看起来像
a = [('n_a xxxx 1111',0), ('n_A xxsssxx 1211',0),('n_a 1111',0),('n_c xxxx 1111',0)]
a = spark.createDataFrame(a, ['des', 'id'])
a.show(10,False)
我想选择des_column starts
与' n_a'(不区分大小写)的行,并获取前4位数字以构建新列,结果应如下所示
b = [('n_a 1111', ), ('n_A 1211', ),('n_a 1111', )]
b = spark.createDataFrame(b, ['new_column'])
b.show(10, False)
答案 0 :(得分:2)
您可以使用regexp_extract
from pyspark.sql.functions import *
r = ("(?i)" # Case insensitive
"^(n_a)" # Leading n_a
"(?:\s\S+\s|\s)" # Either whitespace string whitespace or whitespace
"([0-9]{4})") # Four digit number
a.select("id", concat_ws(
" ",
regexp_extract("des", r, 1), # n_a prefix
regexp_extract("des", r, 2) # number
).alias("new_column")).where(trim(col("new_column")) != "")
给出:
+---+----------+
| id|new_column|
+---+----------+
| 0| n_a 1111|
| 0| n_A 1211|
| 0| n_a 1111|
+---+----------+
答案 1 :(得分:1)
尝试下面的代码段:
from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import regexp_extract
list = ['n_a', 'n_A']
a.where(a.des.substr(1, 3).isin(list)).select(concat_ws(' ', regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 1), regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 2)).alias('new_column')).show(10, False)
它给你:
+----------+
|new_column|
+----------+
|n_a 1111 |
|n_A 1211 |
|n_a 1111 |
+----------+
希望这能解决你的问题。
它主要使用两个功能:
where(a.des.substr(1, 3).isin(list))
请求“以特殊字符串”
开头选择列值然后是concat:
concat_ws(' ', regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 1), regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 2)).alias('new_column')
表示“获取前4位数字来构建新列”,这是使用正则表达式:
(\w\_\w).*(\d\d\d\d).*
答案 2 :(得分:0)
试试这个:
import pyspark.sql.functions as f
a=a.withColumn('des_clean', f.regexp_replace('des', "\s([^\s]+)\s", " "))
a=a.withColumn('split',f.split(f.col('des_clean'), "\s"))
a=a.withColumn('first', f.col('split').getItem(0))
a=a.withColumn('second', f.col('split').getItem(1))
a=a.filter("first in ('n_a', 'n_A')")
a.select('des_clean').show(10)