我的数据框看起来像这样。
我有一个pyspark数据框,我想使用正则表达式将列A分成A1和A2,但这是行不通的。
A | A1 | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed
我的代码如下
import re
from pyspark.sql.functions import regexp_extract
reg = r'^([\d]+-[\d]+-[\d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()
答案 0 :(得分:1)
您可以将正则表达式用作udf
并获得所需的输出,如下所示:
>>> import re
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> def get_date_day(a):
... x, y = re.split('^([\d]+-[\d]+-[\d]+)', a)[1:]
... return [x, y[1:]]
>>> get_date_day('20-13-2012-monday')
['20-13-2012', 'monday']
>>> get_date_day('20-13-2012-monday')
['20-13-2012', '-monday']
>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))
>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
>>> df.show()
+-----------------+
| A|
+-----------------+
|20-13-2012-monday|
| 20-14-2012-tues|
| 20-13-2012-wed|
+-----------------+
>>> df = df.withColumn("A12", get_date_udf('A'))
>>> df.show(truncate=False)
+-----------------+--------------------+
|A |A12 |
+-----------------+--------------------+
|20-13-2012-monday|[20-13-2012, monday]|
|20-14-2012-tues |[20-14-2012, tues] |
|20-13-2012-wed |[20-13-2012, wed] |
+-----------------+--------------------+
>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
>>> df = df.drop('A12')
>>> df.show(truncate=False)
+-----------------+----------+------+
|A |A1 |A2 |
+-----------------+----------+------+
|20-13-2012-monday|20-13-2012|monday|
|20-14-2012-tues |20-14-2012|tues |
|20-13-2012-wed |20-13-2012|wed |
+-----------------+----------+------+
希望这会有所帮助!