Pyspark以特殊字符串开头选择列值

时间:2017-12-12 16:45:12

标签: apache-spark pyspark

我使用PySpark的Spark 2。数据框看起来像

a = [('n_a xxxx 1111',0), ('n_A xxsssxx 1211',0),('n_a 1111',0),('n_c xxxx 1111',0)]
a = spark.createDataFrame(a, ['des', 'id'])
a.show(10,False)

我想选择des_column starts与' n_a'(不区分大小写)的行,并获取前4位数字以构建新列,结果应如下所示

b = [('n_a 1111', ), ('n_A 1211', ),('n_a 1111', )]
b = spark.createDataFrame(b, ['new_column'])
b.show(10, False)

3 个答案:

答案 0 :(得分:2)

您可以使用regexp_extract

from pyspark.sql.functions import *

r = ("(?i)"            # Case insensitive
     "^(n_a)"          # Leading n_a
     "(?:\s\S+\s|\s)"  # Either whitespace string whitespace or whitespace
     "([0-9]{4})")     # Four digit number 

a.select("id", concat_ws(
    " ", 
    regexp_extract("des", r, 1),  #  n_a prefix
    regexp_extract("des", r, 2)   #  number
).alias("new_column")).where(trim(col("new_column")) != "")

给出:

+---+----------+
| id|new_column|
+---+----------+
|  0|  n_a 1111|
|  0|  n_A 1211|
|  0|  n_a 1111|
+---+----------+

答案 1 :(得分:1)

尝试下面的代码段:

from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import regexp_extract
list = ['n_a', 'n_A']
a.where(a.des.substr(1, 3).isin(list)).select(concat_ws(' ', regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 1), regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 2)).alias('new_column')).show(10, False)

它给你:

+----------+
|new_column|
+----------+
|n_a 1111  |
|n_A 1211  |
|n_a 1111  |
+----------+

希望这能解决你的问题。

它主要使用两个功能:

where(a.des.substr(1, 3).isin(list))

请求“以特殊字符串”

开头选择列值

然后是concat:

concat_ws(' ', regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 1), regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 2)).alias('new_column')

表示“获取前4位数字来构建新列”,这是使用正则表达式:

(\w\_\w).*(\d\d\d\d).*

答案 2 :(得分:0)

试试这个:

import pyspark.sql.functions as f

a=a.withColumn('des_clean', f.regexp_replace('des', "\s([^\s]+)\s", " "))
a=a.withColumn('split',f.split(f.col('des_clean'), "\s"))
a=a.withColumn('first', f.col('split').getItem(0))
a=a.withColumn('second', f.col('split').getItem(1))
a=a.filter("first in ('n_a', 'n_A')")
a.select('des_clean').show(10)